Preprint / Version 1

Similar patent search using Contrastive Learning

##article.authors##

  • Yuki Hoshino Tokyo Institute of Technology, School of Engineering
  • Yoshimasa Utsumi Rakuten Group, Inc. Intellectual Property Department
  • Kazuhide Nakata Tokyo Institute of Technology, School of Engineering

DOI:

https://doi.org/10.51094/jxiv.344

Keywords:

Patent, Contrastive Learning, Natural Language Processing

Abstract

In recent years, the management of intellectual property has become increasingly important to society. In particular, there are more than 300,000 patent applications filed every year, and there are many challenges in dealing with the huge amount of patents. Therefore, in this study, we considered the application of Contrastive Learning to the task of searching for similar patents, which is very important in handling patents. On the other hand, it is not clear what should be used as input among patent information. In addition, there has been no research on what to use as teacher data when using Contrastive Learning. Therefore, in this paper, we conducted a similar patent search by using three devices. First, for the input method, we proposed to input the full text of the claims, and made all the tokenizers and encoders by ourselves. Next, we proposed to use citation information as the teacher data to perform Contrastive Learning. Finally, we proposed an IPC-based method for creating hard negatives in Contrastive Learning. Furthermore, we conducted two tests using actual patent data. First, we conducted numerical experiments using citation information used in patent examination to verify the effectiveness of the proposed method. Furthermore, we verified the results of the actual operation of the method by using several cases of invalidation requests.

Conflicts of Interest Disclosure

Supported by Rakuten Group, Inc.

Downloads *Displays the aggregated results up to the previous day.

Download data is not yet available.

References

Abood Aaron and Feltenberger Dave. Automated patent landscaping. Artificial Intelligence and Law, Vol. 26, No. 2, pp. 103–125, 2018.

Hidir Aras, Rima T ̈urker, Dieter Geiss, Max Milbradt, and Harald Sack. Get your hands dirty: Evaluating word2vec models for patent data. In Proceedings of the SEMANTiCS Posters&Demos, 2018.

Samuel Bowman, Gabor Angeli, Christopher Potts, and Christopher Manning. A large annotated corpus for learning natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 632–642, 2015.

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, pp. 1597–1607, 2020.

Seokkyu Choi, Hyeonju Lee, Eunjeong Park, and Sungchul Choi. Deep learning for patent landscaping using transformer and graph embedding. Technological Forecasting and Social Change, Vol. 175, p. 121413, 2022.

Kevin Clark, Minh-Thang Luong, Quoc Le, and Christopher Manning. Electra: Pre-training text encoders as discriminators rather than generators. In Proceedings of the International Conference on Learning Representations, 2020.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186, 2019.

C. J. Fall, A. T ̈orcsv ́ari, K. Benzineb, and G. Karetka. Automated categorization in the international patent classification. SIGIR Forum, Vol. 37, No. 1, p. 10–25, 2003.

Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the Empirical Methods in Natural Language Processing, 2021.

Mattyws Grawe, Claudia Martins, and Andreia Bonfante. Automated patent classification using word embedding. In Proceedings of the IEEE International Conference on Machine Learning and Applications, pp. 408–411, 2017.

Yuki Hoshino, Yoshimasa Utsumi, Yoshiro Matsuda, Yoshitoshi Tanaka, and Kazuhide Nakata. Ipc prediction of patent documents using neural network with attention for hierarchical structure. Research Square preprint DOI:10.21203/rs.3.rs1164669/v1, 2022.

Folasade Olubusola Isinkaye, Yetunde O Folajimi, and Bolande Adefowoke Ojokoh. Recommendation systems: Principles, methods and evaluation. Egyptian informatics journal, Vol. 16, No. 3, pp. 261–273, 2015.

Folasade Olubusola Isinkaye, Yetunde O Folajimi, and Bolande Adefowoke Ojokoh. Recommendation systems: Principles, methods and evaluation. Egyptian informatics journal, Vol. 16, No. 3, pp. 261–273, 2015.

Jaeyoung Kim, Janghyeok Yoon, Eunjeong Park, and Sungchul Choi. Patent document clustering with deep embeddings. Scientometrics, Vol. 123, No. 2, pp. 563–577, 2020.

Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 66–75, 2018.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. In Proceedings of the International Conference on Learning Representations, 2020.

Jieh-Sheng Lee and Jieh Hsiang. Patent claim generation by fine-tuning openai gpt-2. World Patent Information, Vol. 62, p. 101983, 2020.

Jieh-Sheng Lee and Jieh Hsiang. Patent classification by fine-tuning bert language model. World Patent Information, Vol. 61, p. 101965, 2020.

Shaobo Li, Jie Hu, Yuxin Cui, and Jianjun Hu. Deeppatent: patent classification with convolutional neural networks and word embedding. Scientometrics, Vol. 117, No. 2, pp. 721–744, 2018.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.

Suzuki Masatoshi. cl-tohoku/bert-japanese-v2: Bert base japanese, 2020.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations, 2013.

Leif Peterson. K-nearest neighbor. Scholarpedia, Vol. 4, No. 2, p.1883, 2009.

Julian Risch, Samuele Garda, and Ralf Krestel. Hierarchical document classification as a sequence generation task. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 2020.

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 1715–1725, 2016.

Shan Suthaharan. Support vector machine. In Proceedings of the Machine learning models and algorithms for big data classification, pp. 207–235, 2016.

Kazuma Takaoka, Sorami Hisamoto, Noriko Kawahara, Miho Sakamoto, Yoshitaka Uchida, and Yuji Matsumoto. Sudachi: a japanese tokenizer for business. In Proceedings of the International Conference on Language Resources and Evaluation, 2018.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.

Posted


Submitted: 2023-03-29 06:27:44 UTC

Published: 2023-03-31 10:52:20 UTC
Section
Information Sciences