Contrastive Learningを利用した類似特許検索

星野, 雄毅; 内海, 祥雅; 中田, 和秀

doi:10.51094/jxiv.344

##article.authors##

星野, 雄毅東京工業大学工学院
内海, 祥雅楽天グループ株式会社・知的財産部
中田, 和秀東京工業大学工学院

DOI:

https://doi.org/10.51094/jxiv.344

キーワード:

自然言語処理、特許、対照学習

抄録

近年，知的財産の管理は社会にとって重要となってきている．特に，特許は毎年30万件を超える出願があり，膨大な量の特許を処理する上で多くの課題が存在する．そこで，本研究では特許を扱う上で非常に重要な類似特許検索タスクについて，Contrastive Learningの応用を考えた．一方，特許情報の中で何を入力とするべきかについては，定かではない．また，Contrastive Learningを利用するにあたって，教師データに何を用いるかについては，いまだ研究がなされていない．そこで，本稿では３つの工夫を用いて，類似特許検索を実施した．まず，入力方法について，請求項全文を入力することを提案し，トークナイザー及びエンコーダを全て自作した．次に，Contrastive Learningを実施する教師データについて，引用情報を用いることを提案した．最後に，Contrastive Learningを実施する上でのHard NegativeについてIPCを用いた作成方法を提案した．さらに，実際の特許データを用いて２つの検証を行った．まず，特許の審査の際に用いられた引用情報を用いた数値実験を行いその効果を検証した．さらに，無効審判請求の事例をいくつか用いて，実際に運用した際の結果について検証を行った．

利益相反に関する開示

楽天グループ株式会社から援助を受けています

ダウンロード *前日までの集計結果を表示します

ダウンロード実績データは、公開の翌日以降に作成されます。

引用文献

Abood Aaron and Feltenberger Dave. Automated patent landscaping. Artificial Intelligence and Law, Vol. 26, No. 2, pp. 103–125, 2018.

Hidir Aras, Rima T ̈urker, Dieter Geiss, Max Milbradt, and Harald Sack. Get your hands dirty: Evaluating word2vec models for patent data. In Proceedings of the SEMANTiCS Posters&Demos, 2018.

Samuel Bowman, Gabor Angeli, Christopher Potts, and Christopher Manning. A large annotated corpus for learning natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 632–642, 2015.

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, pp. 1597–1607, 2020.

Seokkyu Choi, Hyeonju Lee, Eunjeong Park, and Sungchul Choi. Deep learning for patent landscaping using transformer and graph embedding. Technological Forecasting and Social Change, Vol. 175, p. 121413, 2022.

Kevin Clark, Minh-Thang Luong, Quoc Le, and Christopher Manning. Electra: Pre-training text encoders as discriminators rather than generators. In Proceedings of the International Conference on Learning Representations, 2020.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186, 2019.

C. J. Fall, A. T ̈orcsv ́ari, K. Benzineb, and G. Karetka. Automated categorization in the international patent classification. SIGIR Forum, Vol. 37, No. 1, p. 10–25, 2003.

Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the Empirical Methods in Natural Language Processing, 2021.

Mattyws Grawe, Claudia Martins, and Andreia Bonfante. Automated patent classification using word embedding. In Proceedings of the IEEE International Conference on Machine Learning and Applications, pp. 408–411, 2017.

Yuki Hoshino, Yoshimasa Utsumi, Yoshiro Matsuda, Yoshitoshi Tanaka, and Kazuhide Nakata. Ipc prediction of patent documents using neural network with attention for hierarchical structure. Research Square preprint DOI:10.21203/rs.3.rs1164669/v1, 2022.

Folasade Olubusola Isinkaye, Yetunde O Folajimi, and Bolande Adefowoke Ojokoh. Recommendation systems: Principles, methods and evaluation. Egyptian informatics journal, Vol. 16, No. 3, pp. 261–273, 2015.

Jaeyoung Kim, Janghyeok Yoon, Eunjeong Park, and Sungchul Choi. Patent document clustering with deep embeddings. Scientometrics, Vol. 123, No. 2, pp. 563–577, 2020.

Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 66–75, 2018.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. In Proceedings of the International Conference on Learning Representations, 2020.

Jieh-Sheng Lee and Jieh Hsiang. Patent claim generation by fine-tuning openai gpt-2. World Patent Information, Vol. 62, p. 101983, 2020.

Jieh-Sheng Lee and Jieh Hsiang. Patent classification by fine-tuning bert language model. World Patent Information, Vol. 61, p. 101965, 2020.

Shaobo Li, Jie Hu, Yuxin Cui, and Jianjun Hu. Deeppatent: patent classification with convolutional neural networks and word embedding. Scientometrics, Vol. 117, No. 2, pp. 721–744, 2018.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.

Suzuki Masatoshi. cl-tohoku/bert-japanese-v2: Bert base japanese, 2020.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations, 2013.

Leif Peterson. K-nearest neighbor. Scholarpedia, Vol. 4, No. 2, p.1883, 2009.

Julian Risch, Samuele Garda, and Ralf Krestel. Hierarchical document classification as a sequence generation task. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 2020.

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 1715–1725, 2016.

Shan Suthaharan. Support vector machine. In Proceedings of the Machine learning models and algorithms for big data classification, pp. 207–235, 2016.

Kazuma Takaoka, Sorami Hisamoto, Noriko Kawahara, Miho Sakamoto, Yoshitaka Uchida, and Yuji Matsumoto. Sudachi: a japanese tokenizer for business. In Proceedings of the International Conference on Language Resources and Evaluation, 2018.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.