MaterialBERT for Natural Language Processing of Materials Science Texts
キーワード:word embedding、 pre-training、 BERT、 literal information
A BERT (Bidirectional Encoder Representations from Transformers) model, which we named “MaterialBERT,” has been generated using scientific papers in wide area of material science as a corpus. A new vocabulary list for tokenizer was generated using material science corpus. Two BERT models with different vocabulary lists for the tokenizer, one with the original one made by Google and the other newly made by the authors, were generated. Word vectors embedded during the pre-training with the two MaterialBERT models reasonably reflect the meanings of materials names in material-class clustering and in the relationship between base materials and their compounds or derivatives for not only inorganic materials but also organic materials and organometallic compounds. Fine-tuning with CoLA (The Corpus of Linguistic Acceptability) using the pre-trained MaterialBERT showed a higher score than the original BERT.
List of universities offering degrees in “business informatics”. Available from: https://en.everybodywiki.com/List_of_universities_offering_degrees_in_business_informatics
BizNews. What is business informatics? Available from: https://biznewske.com/what-is-business-informatics/
A journal with title “industrial informatics”. IEEE Transactions on Industrial Informatics. ISSN: 1551-3203.
Ramprasad R, Batra R, Pilania G, et al. Machine learning in materials informatics: recent applications and prospects. npj Comput Mater. 2017;3:54 (1-13). https://doi.org/10.1038/s41524-017-0056-5
Agrawal A, Choudhary A, Perspective: Materials informatics and big data: Realization of the “fourth paradigm” of science in materials science. APL Materials. 2016; 4: 053208 (1-10). https://doi.org/10.1063/1.4946894
Tanaka F, Sato H, Yoshii N, et al. Materials Informatics for Process and Material Co-Optimization. IEEE Transactions on Semiconductor Manufacturing. 2019;32:444-449.
Hassan AUl, Hussain J, Hussain M, et al. Sentiment analysis of social networking sites (SNS) data using machine learning approach for the measurement of depression. Proceedings of 2017 International Conference on Information and Communication Technology Convergence (ICTC); 2017 Oct 18-20; Jeju, South Korea. IEEE; 2017.
Yoshida S, Kitazono J, Ozawa S, et al. Sentiment analysis for various SNS media using Naïve Bayes classifier and its application to flaming detection. Proceedings of 2014 IEEE Symposium on Computational Intelligence in Big Data (CIBD); 2014 Dec. 9-12; Orlando, FL, USA. IEEE; 2015.
Ahn H, Lee S. An Analytic Study on Private SNS for Bonding Social Networking. In Meiselwitz, G. (eds) Social Computing and Social Media. SCSM 2015. Lecture Notes in Computer Science(), vol 9182. Springer, Cham. https://doi.org/10.1007/978-3-319-20367-6_12
Khairi SSM, Ghani RAM. Analysis of social networking sites on academic performance among university students: A PLS-SEM approach. AIP Conference Proceedings. 2019; 2138, 050015. Available from: https://doi.org/10.1063/1.5121120
Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26 (NIPS 2013). Available from: https://papers.nips.cc/paper/2013.
Tshitoyan V, Dagdelen J, Weston L, et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature. 2019;571:95–98.
Long short-term memory. Available from: https://en.wikipedia.org/wiki/Long_short-term_memory
Devlin J, Chang MW, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. Available from: https://arxiv.org/pdf/1810.04805.pdf.
Wang A, Singh A, Michael J, et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018, 353–355, Available from: https://aclanthology.org/W18-5446, https://gluebenchmark.com/
Zhu Y, Kiros R, Zemel R, et al. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the 2015 IEEE International Conference on Computer Vision(ICCV); 2015 Dec 7-13; Santiago, Chile. IEEE; 2015. p.19–27.
Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining; 2019 Oct 31. arXiv:1901.08746. Available from: https://arxiv.org/abs/1901.08746
Rasmy L, Xiang Y, Xie Z, Cui Tao, et al. Med-BERT: pre-trained contextualized embeddings on large-scale structured electronic health records for disease prediction; 2020 May 22. arXiv:2005.12833v1. Available from: https://doi.org/10.48550/arXiv.2005.12833
Beltagy I, Lo K, Cohan A. SCIBERT: A pretrained language model for scientific text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. p. 3615–3620, DOI:10.18653/v1/D19-1371. Available from: https://arxiv.org/abs/1903.10676
BERT Japanese Pretrained Model. Available from: https://nlp.ist.i.kyoto-u.ac.jp/?ku_bert_japanese, https://laboro.ai/activity/column/engineer/laboro-bert/.
Pretrained Japanese BERT models. Available from: https://github.com/cl-tohoku/bert-japanese
Yang Y, Christopher M, Siy UYet al. FinBERT: A pretrained language model for financial communications. Available from: https://arxiv.org/abs/2006.08097
Chalkidis I, Fergadiotis M, Malakasiotis P, et al. LEGAL-BERT: The Muppets straight out of Law School. arXiv:2010.02559v1, 2020 Oct 6. Available from: https://doi.org/10.48550/arXiv.2010.02559
Yoshitake M, Kuwajima I, Yagyu S, et al. System for Searching Relationship among Physical Properties for Materials CurationTM. Vac. Surf. Sci. 2018;61:200–205.
Yoshitake, M. Tool for Designing Breakthrough Discovery in Materials Science. Materials 2021;14:6946(1-15). Available from: https://doi.org/10.3390/ma14226946
Yoshitake M, Sato F, Kawano H, et al. MaterialBERT for Natural Language Processing of Materials Science Texts. Paper presented at: 68th JSAP Spring Meeting; 2021 Mar 16-19; On line.
Gupta T, Zaki M. Krishnan ANM, et al. MatSciBERT: A Materials Domain Language Model for Text Mining and Information Extraction. arXiv:2109.15290v1, 2021 Sep 30. Available from: https://doi.org/10.48550/arXiv.2109.15290
Walker N, Trewartha A, Huo H,aoyan, et al. The Impact of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science. Available from SSRN: https://ssrn.com/abstract=3950755 or http://dx.doi.org/10.2139/ssrn.3950755
Oka H, Ishii M, Sentence classification for polymer data extraction from scientific articles. Poster session presented at: 69th JSAP Spring Meeting; 2022 Mar 22-26; Sagamihara, Kanagawa.
when one down load BERT-base from https://github.com/google-research/bert, vocab.txt file is included in the zip file
Kudo T, Richardson J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations; p. 66-71. 2018 Oct 31-Nov 4, Brussels, Belgium. Association for Computational Linguistics. Available from: https://github.com/google/sentencepiece
Classes of Materials, University of Cambridge, https://www.doitpoms.ac.uk/tlplib/artefact/classes.php
Introduction to Materials Science and Engineering, University of Washington USA, Prof. Christine Luscombe, http://courses.washington.edu/mse170/powerpoint/luscombe/Week1complete.pdf
“semiconductor” is relatively new class of materials as mentioned in Materials science, https://en.wikipedia.org/wiki/Materials_science and in Materials science and engineering: https://en.wikiversity.org/wiki/Portal:Materials_science_and_engineering
Weston L, Tshitoyan V, Dagdelen J, et al. Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model., 2019;59:3692–3702.
Friedrich A, Adel H, Tomazic F, et al. The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain. Proceedings of the 58th annual meeting of the association for computational linguistics, Association for Computational Linguistics, Online, 2020, pp. 1255–1268.
ProsusAI / finBERT, https://github.com/ProsusAI/finBERT
huggingface / transformers, https://github.com/huggingface/transformers
投稿日時: 2022-08-08 12:09:52 UTC
公開日時: 2022-08-12 05:59:34 UTC
この作品は、Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Licenseの下でライセンスされています。