Preprint / Version 2

Preliminary Annotation for Constructing Japanese Entity Linking Corpus

##article.authors##

DOI:

https://doi.org/10.51094/jxiv.492

Keywords:

natural language processing, named entity recognition, entity disambiguation, entity linking, corpus annotation

Abstract

エンティティリンキングは,言語表現と,実世界の物や概念を表す知識ベース上のエントリとを対応付けるタスクである.同タスクのための言語資源の構築は,英語を中心に行われてきており,日本語のシステムの評価に利用できる言語資源は僅かである.本研究では,日本語エンティティリンキングシステムの評価に適したアノテーションコーパスの構築に向けて,設計方針とアノテーション基準を策定し,小規模なアノテーションを実施した.本稿では,それら方針・基準とともに,アノテーション作業プロセスと得られたデータの記述統計・特徴について報告し,今後の展望を述べる.

Conflicts of Interest Disclosure

The authors declare no conflicts of interest associated with this manuscript.

Downloads *Displays the aggregated results up to the previous day.

Download data is not yet available.

References

Rada Mihalcea and Andras Csomai. Wikify! Linking documents to encyclopedic knowledge. In Proceedings of the 16th ACM Conference on Information and Knowledge Management, pp. 233--242, 2007.

Silviu Cucerzan. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 708--716, Prague, Czech Republic, June 2007.

Paul McNamee, Hoa Trang Dang, Heather Simpson, Patrick Schone, and Stephanie M. Strassel. An evaluation of technologies for knowledge base population. In Proceedings of the 7th International Conference on Language Resources and Evaluation, Valletta, Malta, May 2010.

Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. Robust disambiguation of named entities in text. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 782--792, Edinburgh, Scotland, UK., July 2011.

Lev Ratinov, Dan Roth, Doug Downey, and Mike Anderson. Local and global algorithms for disambiguation to Wikipedia. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1375--1384, Portland, Oregon, USA, June 2011.

Giuseppe Rizzo, Marieke van Erp, Julien Plu, Raphaël Troncy. Making Sense of Microposts (#Microposts2016) Named Entity rEcognition and Linking (NEEL) Challenge. In Proceedings of the 6th Workshop on 'Making Sense of Microposts' co-located with the 25th International World Wide Web Conference, pp. 50--59, 2016.

Xiao Ling, Sameer Singh, and Daniel S. Weld. Design challenges for entity linking. Transactions of the Association for Computational Linguistics, Vol. 3, pp. 315--328, 2015.

Marieke van Erp, Pablo Mendes, Heiko Paulheim, Filip Ilievski, Julien Plu, Giuseppe Rizzo, and Joerg Waitelonis. Evaluating entity linking: An analysis of current benchmark datasets and a roadmap for doing a better job. In Proceedings of the 10th International Conference on Language Resources and Evaluation, pp. 4373--4379, Portorovz, Slovenia, May 2016.

Michael Röder, Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo. Gerbil – Benchmarking named entity recognition and linking consistently. Semantic Web, Vol. 9, No. 5, pp. 605--625, January 2018.

Marcel Milich and Alan Akbik. ZELDA: A comprehensive benchmark for supervised entity disambiguation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 2061--2072, Dubrovnik, Croatia, May 2023.

Davaajav Jargalsaikhan, Naoaki Okazaki, Koji Matsuda, and Kentaro Inui. Building a corpus for Japanese wikification with fine-grained entity classes. In Proceedings of the ACL 2016 Student Research Workshop, pp. 138--144, Berlin, Germany, August 2016.

Yugo Murawaki and Shinsuke Mori. Wikification for scriptio continua. In Proceedings of the 10th International Conference on Language Resources and Evaluation, pp. 1346--1351, Portorovz, Slovenia, May 2016.

関根聡, 中山功太, 隅田飛鳥, 渋木英潔, 門脇一真, 三浦明波, 宇佐美佑, 安藤まや. 森羅タスクと森羅公開データ. 言語処理学会 第29回年次大会 発表論文集, 2023.

OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023.

Koji Matsuda, Akira Sasaki, Naoaki Okazaki, and Kentaro Inui. Geographical entity annotated corpus of Japanese microblogs. Journal of Information Processing, Vol. 25, pp. 121--130, 2017.

Shohei Higashiyama, Hiroki Ouchi, Hiroki Teranishi, Hiroyuki Otomo, Yusuke Ide, Aitaro Yamamoto, Hiroyuki Shindo, Yuki Matsuda, Shoko Wakamiya, Naoya Inoue, Ikuya Yamada, and Taro Watanabe. Arukikata travelogue dataset with geographic entity mention, coreference, and link annotation. arXiv:2305.13844, 2023.

Jan A. Botha, Zifei Shan, and Daniel Gillick. Entity linking in 100 languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 7833--7845, Online, November 2020.

David Kubeša and Milan Straka. DaMuEL: A large multilingual dataset for entity linking. arXiv:2306.09288, 2023.

橋本泰一, 中村俊一. 拡張固有表現タグ付きコーパスの構築-白書, 書籍, Yahoo!知恵袋コアデータ-. 言語処理学会 第16回年次大会 発表論文集, 2010.

Kikuo Maekawa, Makoto Yamazaki, Toshinobu Ogiso, Takehiko Maruyama, Hideki Ogura, Wakako Kashino, Hanae Koiso, Masaya Yamaguchi, Makiro Tanaka, and Yasuharu Den. Balanced corpus of contemporary written Japanese. Language Resources and Evaluation, Vol. 48, No. 2, pp. 345--371, 2014.

Satoshi Sekine, Kiyoshi Sudo, and Chikashi Nobata. Extended named entity hierarchy. In Proceedings of the 3rd International Conference on Language Resources and Evaluation, Las Palmas, Canary Islands - Spain, May 2002.

瀬戸賢一, 宮畑一範, 小倉雅明. [例解]現代レトリック事典. 大修館書店, 2022.

Denny Vrandečić and Markus Krötzsch. Wikidata: A free collaborative knowledgebase. Communications of the ACM, Vol. 57, No. 10, p. 78--85, September 2014.

松田寛, 大村舞, 浅原正幸. 短単位品詞の用法曖昧性解決と依存関係ラベリングの同時学習. 言語処理学会 第25回年次大会 発表論文集, 2019.

Pontus Stenetorp, Sampo Pyysalo, Goran Topić, Tomoko Ohta, Sophia Ananiadou, and Jun'ichi Tsujii. brat: a web-based tool for NLP-assisted text annotation. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 102--107, Avignon, France, April 2012.

Posted


Submitted: 2023-08-24 14:12:28 UTC

Published: 2023-08-29 09:55:37 UTC — Updated on 2023-08-30 08:23:42 UTC

Versions

Reason(s) for revision

Due to some misunderstanding, there was an inaccurate description about the research by Sekine et al. [13] in Section 2 (Page 2). We consulted with the authors and have corrected the information as follows: Before: これらタスクでは,システムで自動アノテーションされたデータが評価に使用されている.(In these tasks, the data automatically annotated by a system was used for evaluation.) After: 同タスクの評価用データとして,人手でアノテーションされたデータが使用されている.(In the tasks, manually annotated data was used for evaluation.)
Section
Information Sciences