MaterialBERT for Natural Language Processing of Materials Science Texts

Yoshitake, Michiko; Fumitaka Sato; Hiroyuki Kawano; Hiroshi Teraoka

doi:10.51094/jxiv.119

##article.authors##

Yoshitake, Michiko National Institute for Materials Science, MaDIS
Fumitaka Sato National Institute for Materials Science, MaDIS; Business Science Unit，Ridgelinez Limited
Hiroyuki Kawano National Institute for Materials Science, MaDIS; Business Science Unit，Ridgelinez Limited
Hiroshi Teraoka National Institute for Materials Science, MaDIS; Business Science Unit，Ridgelinez Limited

DOI:

https://doi.org/10.51094/jxiv.119

キーワード:

word embedding、 pre-training、 BERT、 literal information

抄録

A BERT (Bidirectional Encoder Representations from Transformers) model, which we named “MaterialBERT,” has been generated using scientific papers in wide area of material science as a corpus. A new vocabulary list for tokenizer was generated using material science corpus. Two BERT models with different vocabulary lists for the tokenizer, one with the original one made by Google and the other newly made by the authors, were generated. Word vectors embedded during the pre-training with the two MaterialBERT models reasonably reflect the meanings of materials names in material-class clustering and in the relationship between base materials and their compounds or derivatives for not only inorganic materials but also organic materials and organometallic compounds. Fine-tuning with CoLA (The Corpus of Linguistic Acceptability) using the pre-trained MaterialBERT showed a higher score than the original BERT.

ダウンロード *前日までの集計結果を表示します

ダウンロード実績データは、公開の翌日以降に作成されます。

引用文献

List of universities offering degrees in “business informatics”. Available from: https://en.everybodywiki.com/List_of_universities_offering_degrees_in_business_informatics

BizNews. What is business informatics? Available from: https://biznewske.com/what-is-business-informatics/

A journal with title “industrial informatics”. IEEE Transactions on Industrial Informatics. ISSN: 1551-3203.

Ramprasad R, Batra R, Pilania G, et al. Machine learning in materials informatics: recent applications and prospects. npj Comput Mater. 2017;3:54 (1-13). https://doi.org/10.1038/s41524-017-0056-5

Agrawal A, Choudhary A, Perspective: Materials informatics and big data: Realization of the “fourth paradigm” of science in materials science. APL Materials. 2016; 4: 053208 (1-10). https://doi.org/10.1063/1.4946894

Tanaka F, Sato H, Yoshii N, et al. Materials Informatics for Process and Material Co-Optimization. IEEE Transactions on Semiconductor Manufacturing. 2019;32:444-449.

Hassan AUl, Hussain J, Hussain M, et al. Sentiment analysis of social networking sites (SNS) data using machine learning approach for the measurement of depression. Proceedings of 2017 International Conference on Information and Communication Technology Convergence (ICTC); 2017 Oct 18-20; Jeju, South Korea. IEEE; 2017.

Yoshida S, Kitazono J, Ozawa S, et al. Sentiment analysis for various SNS media using Naïve Bayes classifier and its application to flaming detection. Proceedings of 2014 IEEE Symposium on Computational Intelligence in Big Data (CIBD); 2014 Dec. 9-12; Orlando, FL, USA. IEEE; 2015.

Ahn H, Lee S. An Analytic Study on Private SNS for Bonding Social Networking. In Meiselwitz, G. (eds) Social Computing and Social Media. SCSM 2015. Lecture Notes in Computer Science(), vol 9182. Springer, Cham. https://doi.org/10.1007/978-3-319-20367-6_12

Khairi SSM, Ghani RAM. Analysis of social networking sites on academic performance among university students: A PLS-SEM approach. AIP Conference Proceedings. 2019; 2138, 050015. Available from: https://doi.org/10.1063/1.5121120

Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26 (NIPS 2013). Available from: https://papers.nips.cc/paper/2013.

Tshitoyan V, Dagdelen J, Weston L, et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature. 2019;571:95–98.

Long short-term memory. Available from: https://en.wikipedia.org/wiki/Long_short-term_memory

Devlin J, Chang MW, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. Available from: https://arxiv.org/pdf/1810.04805.pdf.

Wang A, Singh A, Michael J, et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018, 353–355, Available from: https://aclanthology.org/W18-5446, https://gluebenchmark.com/

Zhu Y, Kiros R, Zemel R, et al. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the 2015 IEEE International Conference on Computer Vision(ICCV); 2015 Dec 7-13; Santiago, Chile. IEEE; 2015. p.19–27.

Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining; 2019 Oct 31. arXiv:1901.08746. Available from: https://arxiv.org/abs/1901.08746

Rasmy L, Xiang Y, Xie Z, Cui Tao, et al. Med-BERT: pre-trained contextualized embeddings on large-scale structured electronic health records for disease prediction; 2020 May 22. arXiv:2005.12833v1. Available from: https://doi.org/10.48550/arXiv.2005.12833

Beltagy I, Lo K, Cohan A. SCIBERT: A pretrained language model for scientific text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. p. 3615–3620, DOI:10.18653/v1/D19-1371. Available from: https://arxiv.org/abs/1903.10676

BERT Japanese Pretrained Model. Available from: https://nlp.ist.i.kyoto-u.ac.jp/?ku_bert_japanese, https://laboro.ai/activity/column/engineer/laboro-bert/.

Pretrained Japanese BERT models. Available from: https://github.com/cl-tohoku/bert-japanese

Yang Y, Christopher M, Siy UYet al. FinBERT: A pretrained language model for financial communications. Available from: https://arxiv.org/abs/2006.08097

Chalkidis I, Fergadiotis M, Malakasiotis P, et al. LEGAL-BERT: The Muppets straight out of Law School. arXiv:2010.02559v1, 2020 Oct 6. Available from: https://doi.org/10.48550/arXiv.2010.02559

Yoshitake M, Kuwajima I, Yagyu S, et al. System for Searching Relationship among Physical Properties for Materials CurationTM. Vac. Surf. Sci. 2018;61:200–205.

Yoshitake, M. Tool for Designing Breakthrough Discovery in Materials Science. Materials 2021;14:6946(1-15). Available from: https://doi.org/10.3390/ma14226946

Yoshitake M, Sato F, Kawano H, et al. MaterialBERT for Natural Language Processing of Materials Science Texts. Paper presented at: 68th JSAP Spring Meeting; 2021 Mar 16-19; On line.

Gupta T, Zaki M. Krishnan ANM, et al. MatSciBERT: A Materials Domain Language Model for Text Mining and Information Extraction. arXiv:2109.15290v1, 2021 Sep 30. Available from: https://doi.org/10.48550/arXiv.2109.15290

Walker N, Trewartha A, Huo H,aoyan, et al. The Impact of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science. Available from SSRN: https://ssrn.com/abstract=3950755 or http://dx.doi.org/10.2139/ssrn.3950755

Oka H, Ishii M, Sentence classification for polymer data extraction from scientific articles. Poster session presented at: 69th JSAP Spring Meeting; 2022 Mar 22-26; Sagamihara, Kanagawa.

when one down load BERT-base from https://github.com/google-research/bert, vocab.txt file is included in the zip file

Kudo T, Richardson J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations; p. 66-71. 2018 Oct 31-Nov 4, Brussels, Belgium. Association for Computational Linguistics. Available from: https://github.com/google/sentencepiece

Classes of Materials, University of Cambridge, https://www.doitpoms.ac.uk/tlplib/artefact/classes.php

Introduction to Materials Science and Engineering, University of Washington USA, Prof. Christine Luscombe, http://courses.washington.edu/mse170/powerpoint/luscombe/Week1complete.pdf

“semiconductor” is relatively new class of materials as mentioned in Materials science, https://en.wikipedia.org/wiki/Materials_science and in Materials science and engineering: https://en.wikiversity.org/wiki/Portal:Materials_science_and_engineering

Warstadt A, Singh A, Bowman SR. Neural network acceptability judgments. Available from: https://arxiv.org/abs/1805.12471, https://nyu-mll.github.io/CoLA/

Weston L, Tshitoyan V, Dagdelen J, et al. Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model., 2019;59:3692–3702.

Friedrich A, Adel H, Tomazic F, et al. The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain. Proceedings of the 58th annual meeting of the association for computational linguistics, Association for Computational Linguistics, Online, 2020, pp. 1255–1268.

ProsusAI / finBERT, https://github.com/ProsusAI/finBERT

huggingface / transformers, https://github.com/huggingface/transformers

https://doi.org/10.48505/nims.3705