Preprint / Version 1

An Inquiry into the Inevitability of Hallucination as a Structural Problem of Large Language Models

Reconsidering the Relationship between Grammar Learning and Knowledge Acquisition

##article.authors##

DOI:

https://doi.org/10.51094/jxiv.1746

Keywords:

Large Language Model, Natural Language Processing, Deep Learning, Hallucination

Abstract

With the widespread adoption of large language models (LLMs), various industries have undergone continuous transformation. However, the generation of non-factual responses, known as hallucinations, has become a major concern. This study reexamines the learning mechanism of LLMs from a structural perspective and demonstrates that hallucination is an inevitable consequence of their architecture. In Experiment 1, we analyze how changes in the assigned role within prompts affect model outputs. In Experiment 2, we examine the generation of bibliographic information and classify the causes of hallucination into two distinct patterns. Based on both quantitative and qualitative analyses, we argue that LLMs do not retain universal knowledge. Rather, they primarily learn grammatical patterns grounded in token dependencies, with knowledge embedded only as an incidental byproduct of that learning process. Finally, we theoretically discuss the structural constraint arising from the inseparability of grammar and knowledge, which renders hallucination an unavoidable property of language models.

Conflicts of Interest Disclosure

The authors have no conflicts of interest to declare in relation to this study.

Downloads *Displays the aggregated results up to the previous day.

Download data is not yet available.

References

Abadi, Martin, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang (2016) “Deep learning with differential privacy,” in Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 308–318.

Anisfeld, Moshe, Erica S Rosenberg, Mara J Hoberman, and Don Gasparini (1998) “Lexical acceleration coincides with the onset of combinatorial speech,” First Language, Vol. 18, No. 53, pp. 165–184.

Bahdanau, Dzmitry, Kyung Hyun Cho, and Yoshua Bengio (2015) “Neural machine translation by jointly learning to align and translate,” in 3rd International Conference on Learning Representations, ICLR 2015.

Banerjee, Sourav, Ayushi Agarwal, and Saloni Singla (2024) “LLMs Will Always Hallucinate, and We Need to Live With This,” stat, Vol. 1050, p. 9. Bates, Elizabeth and Judith C Goodman (2013) “On the emergence of grammar from the lexicon,” in The emergence of language: Psychology Press, pp. 29–80.

Bauer, Connie L (1988) “A direct mail customer purchase model,” Journal of Direct Marketing, Vol. 2, No. 3, pp. 16–24.

Bender, Emily M, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell (2021) “On the dangers of stochastic parrots: Can language models be too big?” in Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp. 610–623.

Bengio, Yoshua, Patrice Simard, and Paolo Frasconi (1994) “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, Vol. 5, No. 2, pp. 157–166.

Biderman, Dan, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John Patrick Cunningham (2024) “LoRA Learns Less and Forgets Less,” Transactions on Machine Learning Research.

Blum, Manuel (1967a) “A machine-independent theory of the complexity of recursive functions,” Journal of the ACM (JACM), Vol. 14, No. 2, pp. 322–336.

Blum, Manuel (1967b) “On the size of machines,” Information and control, Vol. 11, No. 3, pp. 257–265.

Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind.

Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell et al. (2020) “Language models are few-shot learners,” Advances in neural information processing systems, Vol. 33, pp. 1877–1901.

Budnikov, Mikhail, Anna Bykova, and Ivan P Yamshchikov (2025) “Generalization potential of large language models,” Neural Computing and Applications, Vol. 37, No. 4, pp. 1973–1997.

Bult, Jan Roelf and Tom Wansbeek (1995) “Optimal selection for direct mail,” Marketing Science, Vol. 14, No. 4, pp. 378–394.

Carlini, Nicholas, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson et al. (2021) “Extracting training data from large language models,” in 30th USENIX security symposium (USENIX Security 21), pp. 2633–2650.

Carlini, Nicholas, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang (2022) “Quantifying memorization across neural language models,” in The Eleventh International Conference on Learning Representations.

Chitturi, Ravindra, Rajagopal Raghunathan, and Vijay Mahajan (2007) “Form versus function: How the intensities of specific emotions evoked in functional versus hedonic trade-offs mediate product preferences,” Journal of marketing research, Vol. 44, No. 4, pp. 702–714.

Chollet, Fran¸cois (2024) “General Intelligence: Define it, measure it, build it,” URL: https://www.youtube.com/watch?v=nL9jEy99Nh0, The 17th Annual AGI Conference (AGI-24).

Chomsky, N. (1975) The Logical Structure of Linguistic Theory : Springer US.

Chomsky, Noam (1986) Knowledge of Language: Its Nature, Origin, and Use.

Chomsky, Noam ((2014) The minimalist program: MIT press.

Christiano, Paul F, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei (2017) “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, Vol. 30, DOI: http://dx.doi.org/10.48550/arXiv.1706.03741.

Clark, Christopher, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova (2019) “BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2924–2936.

Cloud, Alex, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, and Owain Evans (2025) “Subliminal Learning: Language models transmit behavioral traits via hidden signals in data,” arXiv preprint arXiv:2507.14805.

Contreras Kallens, Pablo and Morten H Christiansen (2024) “Distributional semantics: Meaning through culture and interaction,” Topics in cognitive science.

Cybenko, George (1989) “Approximation by superpositions of a sigmoidal function,” Mathematics of control, signals and systems, Vol. 2, No. 4, pp. 303–314.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova (2018) “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, DOI: http://dx.doi.org/10.48550/arXiv.1810.04805.

Dubey, Abhimanyu, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan et al. (2024) “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, DOI: http://dx.doi.org/https://doi.org/10.48550/

arXiv.2407.21783.

Dwork, Cynthia (2006) “Differential privacy,” in International colloquium on automata, languages, and programming, pp. 1–12, Springer.

Elman, Jeffrey L (1990) “Finding structure in time,” Cognitive science, Vol. 14, No. 2, pp. 179–211.

Fader, Peter S, Bruce GS Hardie, and Ka Lok Lee (2005) “RFM and CLV: Using iso-value curves for customer base analysis,” Journal of marketing research, Vol. 42, No. 4, pp. 415–430.

Farquhar, Sebastian, Jannik Kossen, Lorenz Kuhn, and Yarin Gal (2024) “Detecting hallucinations in large language models using semantic entropy,” Nature, Vol. 630, No. 8017, pp. 625–630.

Golgoon, Ashkan, Khashayar Filom, and Arjun Ravi Kannan (2024) “Mechanistic interpretability of large language models with applications to the financial services industry,” in Proceedings of the 5th ACM International Conference on AI in Finance, pp. 660–668.

Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio (2016) Deep learning, Vol. 1: MIT press Cambridge.

Gorishniy, Yury, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko (2021) “Revisiting deep learning models for tabular data,” Advances in neural information processing systems, Vol. 34, pp. 18932–18943.

Gupta, Sunil and Donald R Lehmann (2006) “Customer lifetime value and firm valuation,” Journal of Relationship Marketing, Vol. 5, No. 2-3, pp. 87–110.

Hendrycks, Dan and Kevin Gimpel (2016) “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415.

Hochreiter, Sepp and J¨urgen Schmidhuber (1997) “Long short-term memory,” Neural computation, Vol. 9, No. 8, pp. 1735–1780, DOI: http://dx.doi.org/10.1162/neco.1997.9.8.1735.

Holtzman, Ari, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi (2020) “The Curious Case of Neural Text Degeneration,” in International Conference on Learning Representations.

Hornik, Kurt (1991) “Approximation capabilities of multilayer feedforward networks,” Neural networks, Vol. 4, No. 2, pp. 251–257.

Hou, Xinyi, Yanjie Zhao, Shenao Wang, and Haoyu Wang (2025) “Model context protocol (mcp): Landscape, security threats, and future research directions,” arXiv preprint arXiv:2503.23278.

Hu, Edward J, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen et al. (2022) “LoRA: Low-Rank Adaptation of Large Language Models,” in International Conference on Learning Representations.

Huang, Lei, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin et al. (2025) “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” ACM Transactions on Information Systems, Vol. 43, No. 2, pp. 1–55.

Huang, Xin, Ashish Khetan, Milan Cvitkovic, and Zohar Karnin (2020) “Tabtransformer: Tabular data modeling using contextual embeddings,” arXiv preprint arXiv:2012.06678.

Jacobs, Robert A, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton (1991) “Adaptive mixtures of local experts,” Neural computation, Vol. 3, No. 1, pp. 79–87.

Jacoby, Jacob and Robert W Chestnut (1978) Brand loyalty: Measurement and management : John Wiley & Sons Incorporated.

Jordan, Michael I (1997) “Serial order: A parallel distributed processing approach,” in Advances in psychology, Vol. 121: Elsevier, pp. 471–495.

Kaelbling, Leslie Pack, Michael L Littman, and Andrew W Moore (1996) “Reinforcement learning: A survey,” Journal of artificial intelligence research, Vol. 4, pp. 237–285.

Kahneman, Daniel (2013) “A perspective on judgment and choice: Mapping bounded rationality,” Progress in Psychological Science around the World. Volume 1 Neural, Cognitive and Developmental Issues., pp. 1–47.

Kalai, Adam Tauman, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang (2025) “Why Language Models Hallucinate.”

LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton (2015) “Deep learning,” nature, Vol. 521, No. 7553, pp. 436–444, DOI: http://dx.doi.org/10.1038/nature14539.

Lewis, Mike, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer (2019) “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” arXiv preprint arXiv:1910.13461.

Lewis, Patrick, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel et al. (2020) “Retrieval-augmented generation for knowledge-intensive nlp tasks,” Advances in neural information processing systems, Vol. 33, pp. 9459–9474.

Li, Zhuoyan, Hangxiao Zhu, Zhuoran Lu, and Ming Yin (2023) “Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 10443–10461.

Long, Lin, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang (2024) “On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey,” in Findings of the Association for Computational Linguistics ACL 2024, pp. 11065–11082.

McKenna, Ryan, Yangsibo Huang, Amer Sinha, Borja Balle, Zachary Charles, Christopher A Choquette-Choo, Badih Ghazi, Georgios Kaissis, Ravi Kumar, Ruibo Liu et al. (2025) “Scaling Laws for Differentially Private Language Models,” in Forty-second International Conference on Machine Learning.

Mirchandani, Suvir, Fei Xia, Pete Florence, Brian Ichter, Danny Driess, Montserrat Gonzalez Arenas, Kanishka Rao, Dorsa Sadigh, and Andy Zeng (2023) “Large Language Models as General Pattern Machines,” in Conference on Robot Learning, pp. 2498–2518, PMLR.

Mirzadeh, Iman, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar (2024) “Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models,” arXiv preprint arXiv:2410.05229.

Murphy, Kevin P (2012) Machine learning: a probabilistic perspective: MIT press.

OpenAI (2023) “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774.

Radford, Alec, Karthik Narasimhan, Tim Salimans, Ilya Sutskever et al. (2018) “Improving language understanding by generative pre-training.”

Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever et al. (2019) “Language models are unsupervised multitask learners,” OpenAI blog, Vol. 1, No. 8, p. 9.

Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu (2020) “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, Vol. 21, No. 140, pp. 1–67.

Rumelhart, David E, Geoffrey E Hinton, and Ronald J Williams (1986) “Learning representations by back-propagating errors,” nature, Vol. 323, No. 6088, pp. 533–536, DOI: http://dx.doi.org/10.1038/323533a0.

Sander, Michael Eli and Gabriel Peyr´e (2025) “Towards Understanding the Universality of Transformers for Next-Token Prediction,” in The Thirteenth International Conference on Learning Representations (ICLR 2025).

Shazeer, Noam, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean (2017) “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” arXiv preprint arXiv:1701.06538.

Stiennon, Nisan, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano (2020) “Learning to summarize with human feedback,” Advances in neural information processing systems, Vol. 33, pp. 3008–3021.

Sutton, Richard S (1988) “Learning to predict by the methods of temporal differences,” Machine learning, Vol. 3, No. 1, pp. 9–44.

Sutton, Richard S, Andrew G Barto et al. (1998) Reinforcement learning: An introduction, Vol. 1: MIT press Cambridge.

Tao, Chaofan, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, and Ngai Wong (2024) “Scaling laws with vocabulary: Larger models deserve larger vocabularies,” Advances in Neural Information Processing Systems, Vol. 37, pp. 114147–114179.

Tomasello, Michael (2003) Constructing a Language: A Usage-Based Theory of Language Acquisition : Harvard University Press, URL: http://www.jstor.org/stable/j.ctv26070v8.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin (2017) “Attention is all you need,” Advances in neural information processing systems, Vol. 30.

Vincent, Pascal, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol (2008) “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning, pp. 1096–1103.

Wan, Jun and Lingrui Mei (2025) “Large Language Models as Computable Approximations to Solomonoff Induction,” arXiv preprint arXiv:2505.15784.

Wu, Haixu, Jiehui Xu, Jianmin Wang, and Mingsheng Long (2021) “Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting,” Advances in neural information processing systems, Vol. 34, pp. 22419–22430.

Wu, Xuansheng, Wenlin Yao, Jianshu Chen, Xiaoman Pan, Xiaoyang Wang, Ninghao Liu, and Dong Yu (2024) “From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning,” in Duh, Kevin, Helena Gomez, and Steven Bethard eds. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 2341–2369: Association for Computational Linguistics, DOI: http://dx.doi.org/10.18653/v1/2024.naacl-long.130.

Xu, Ziwei, Sanjay Jain, and Mohan Kankanhalli (2024) “Hallucination is inevitable: An innate limitation of large language models,” arXiv preprint arXiv:2401.11817.

Yang, An, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv et al. (2025) “Qwen3 technical report,” arXiv preprint arXiv:2505.09388.

Yun, Chulhee, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar (2019) “Are Transformers universal approximators of sequence-to-sequence functions?” in International Conference on Learning Representations.

Zhao, Hao, Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion (2024) “Is In-Context Learning Sufficient for Instruction Following in LLMs?” in The Thirteenth International Conference on Learning Representations.

Zhou, Haoyi, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang (2021) “Informer: Beyond efficient transformer for long sequence time-series forecasting,” in Proceedings of the AAAI conference on artificial intelligence, Vol. 35, pp. 11106–11115.

Ziabari, Alireza S, Nona Ghazizadeh, Zhivar Sourati, Farzan Karimi-Malekabadi, Payam Piray, and Morteza Dehghani (2025) “Reasoning on a spectrum: Aligning llms to system 1 and system 2 thinking,” arXiv preprint arXiv:2502.12470.

Ziegler, Daniel M, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving (2019) “Fine-tuning language models from human preferences,” arXiv preprint arXiv:1909.08593.

新美潤一郎(2025) Kolmogorov-Arnold Network のマーケティング解析への応用可能性の検討: 従来的な深層学習手法との理論比較と実データによる購買予測への応用, 名城論叢, Vol. 25, No. 3, pp. 151–176.

Posted


Submitted: 2025-10-16 08:35:51 UTC

Published: 2025-10-24 01:41:19 UTC
Section
Information Sciences