Portuguese word embeddings for the oil and gas industry: Development and evaluation
Introduction
Over the last decades, companies have gathered huge amounts of data stored in unstructured textual format. Considerable potentially valuable information may be hidden within these increasing volumes of documents such as scientific articles, journals, technical reports, operation logs and laboratory analysis. Having all those textual resources correctly identified and processed is crucial to support a wide range of industrial and academic applications (Ittoo et al., 2016). Considering the intense competitiveness in this industrial environment, it has been economically vital for oil and gas (O&G) companies to fully leverage information from their existing data sources, in order to accelerate the pursuit for maximizing their operational efficiency ([Blinston and Blondelle, 2017], [Lu et al, 2019]).
Some recent advances in natural language processing (NLP) with deep learning algorithms (LeCun et al., 2015, [Goodfellow et al, 2016]) were successfully applied by several industrial applications, providing efficiency improvements in their decision-making processes ([Ittoo et al, 2016], [Blinston and Blondelle, 2017], [Young et al, 2018], [Nooralahzadeh et al, 2018], [Cordeiro et al, 2019]). Those algorithms take unstructured text as their basic input, therefore it is important to obtain suitable mathematical representations for the textual elements. Word embedding models have been efficiently used to provide such meaningful representations, which consist of applying unsupervised learning methods on a text corpus to assign a dense n-dimensional vector to each word in a vocabulary. These models can encode semantic and syntactic similarities between words based on the context where they occur (Mikolov et al., 2013a, 2013; Hartmann et al., 2017). These word vector representations are one of the most fundamental units in any NLP application, since they allow machine learning algorithms to achieve better accuracy due to their great generalization capability (Goldberg, 2016).
Several general-domain embeddings for different languages are available for public use ([Pennington et al, 2014], [Bojanowski et al, 2017], [Fares et al, 2017]), including a few models for Portuguese ([Rodrigues et al, 2016], [Hartmann et al, 2017]). However, the highly technical vocabulary of the O&G domain presents a challenge to NLP applications, since some words may assume a completely different meaning from their conventional interpretation ([Nooralahzadeh et al, 2018], [Cordeiro et al, 2019]). For example, a “christmas tree” is an assembly of valves that provides flow control on a oil or gas well – a vector representation drawn from general-domain corpora would hardly capture the intended meaning. Furthermore, there is consistent evidence that developing specialized word embedding models induced from a domain-specific corpus can significantly improve the quality of their semantic representation ([Lai et al, 2016], Pakhomov et al., 2016, [Nooralahzadeh et al, 2018], Wang et al., 2018a, [Alsentzer et al, 2019], Tshitoyan et al., 2019).
Portuguese is one of the languages with the largest number of native speakers. Moreover, recent auction offers for Brazilian pre-salt exploration blocks and improvements on regulatory frameworks have increased the attractiveness for international investments in exploration and production projects (Clavijo et al., 2019). But, despite the importance of the petroleum sector of Portuguese speaking countries, specialized public corpora in this domain are scarce. Furthermore, technical texts in the O&G domain have known differences in linguistics properties and meanings that differ from general-domain texts, motivating the need for specialized embeddings representations for NLP tasks.
Aiming at filling this gap, we introduce PetroVec, a set of specialized pre-trained word embedding models for the O&G domain in Portuguese. PetroVec was trained on a large O&G corpus, which we assembled using thousands of documents such as periodicals, technical reports, glossaries, academic theses, and articles, published by both academia and major companies. We trained the word embedding models from the specialized corpus using Word2vec (Mikolov et al., 2013a) and FastText (Bojanowski et al., 2017), exploring some variations of corpora composition. Since there is a lack of resources to evaluate word embedding models on this domain and language, we created a test set containing semantic relatedness scores for 1500 word pairs, labeled by experts in geosciences and petroleum engineering from both academia and industry. Hence, we were able to perform intrinsic evaluations, assigning a metric of how well the embeddings can encode semantic properties from the corpus. Additionally, we also performed extrinsic evaluations on a downstream task of named entity recognition in geoscience, plus a set of qualitative analyses. With that, our models were thoroughly evaluated, both quantitatively (with intrinsic and extrinsic evaluations) and qualitatively. Furthermore, we conducted a comprehensive analysis comparing our models and a pre-trained general-domain model in Portuguese (Hartmann et al., 2017). Our findings confirm that our specific-domain models capture semantics in a way that is closer to domain experts, with all evaluation alternatives pointing to the same conclusions.
Finally, all the resources developed in this work are available for public use in our repository2, including the pre-trained word embeddings, corpora, test sets and scripts for preprocessing, training and evaluating the models. The main contributions of this work are as follows: (i) a representative set of domain-specific word embedding models for the O&G industry in Portuguese; (ii) the largest corpus ever reported for this domain and language; and (iii) the first annotated test set for intrinsic semantic evaluation for the O&G domain in Portuguese. We believe that many researchers working in Portuguese O&G domain related projects, both from the industry and academia, can benefit from these resources.
The remainder of this article is organized as follows: Section 2 introduces the background concepts. Section 3 surveys the related work in domain-specific word embeddings. In Section 4, we describe the corpus assembly and the training of the embeddings. Then, the next sections report on the different evaluations we performed. Section 5 presents the intrinsic evaluation. The extrinsic evaluation is detailed in Section 6. Section 7 presents the qualitative evaluation. Finally, Section 8 concludes the article.
Section snippets
Background
Natural language processing (NLP) encompasses a set of computational techniques which aim to provide algorithms the ability to automatically analyze text written in human language. These techniques aim to resolve syntactic structure, word disambiguation, and comprehend the semantic scope of a sentence (Manning and Schutze, 1999). Such algorithms have been successfully applied to many downstream tasks, both in academia and industry, such as automatic machine translation (Vaswani et al., 2017,
Related work
Since the popularization of applying word embedding models in NLP applications, especially after several promising results in deep learning algorithms (Young et al., 2018), there has been an effort to provide good quality pre-trained representations for general purposes. Transfer learning techniques are commonly applied to reuse models originally trained on a general-domain corpus, feeding domain-specific algorithms with those pre-trained embeddings to perform a specific task (Ruder et al., 2019
Corpora and language models
Considering the lack of reference corpora, we first gathered a large collection of public documents in the O&G domain in Portuguese. The collection includes scientific and technical publications retrieved from major universities and leading institutions in this field, such as Petrobras (a Brazilian multinational corporation in the petroleum industry) and the Brazilian National Agency of Petroleum, Natural Gas and Biofuels (ANP) (a federal government agency responsible for the regulation of the
Intrinsic evaluation
The intrinsic evaluation aims to assign a metric of how well the embeddings can encode the semantic and syntactic properties of the text. The process consists of using the models to rate the similarity of pairs of words and compare them to the human perception of similarity ([Baroni et al, 2014], Schnabel et al., 2015, [Gladkova and Drozd, 2016]). After a thorough search, we found no evaluation datasets in the O&G domain for Portuguese. Thus, in order to create a dataset for intrinsic
Extrinsic evaluation
Extrinsic evaluations measure the contribution of a word embedding model when used as input for specific NLP tasks ([Turian et al, 2010], Schnabel et al., 2015), such as automatic text classification, named entity recognition (NER) or part-of-speech tagging. In this work we perform an evaluation for the task of NER in Geoscience related literature. In the next sections, we report on the methodology and the results.
Qualitative evaluation
In addition to the aforementioned intrinsic and extrinsic evaluations, we conducted some experiments on qualitative analyses of semantic relatedness for sets of terms representing the O&G technical vocabulary. These evaluations include word analogies, semantic space coherence and categorization ([Turian et al, 2010], Schnabel et al., 2015).
Conclusion
In this work, we introduced PetroVec, a set of domain-specific pre-trained word embedding models for the O&G industry in Portuguese. These embeddings were induced from a large collection of textual resources gathered from leading institutions in this domain. We also created an annotated test set, labeled by experts in geosciences and petroleum engineering, designed to perform intrinsic semantic metrics. The generated models were thoroughly evaluated, both quantitatively (with intrinsic and
Conflict of interest
The authors declare no conflict of interest.
Declaration of Competing Interest
The authors report no declarations of interest.
Acknowledgments
This work has been partially funded by CENPES Petrobras, CNPq-Brazil, and Capes Finance Code 001.
References (86)
- et al.
Process alarm prediction using deep learning and word embedding methods
ISA Transactions
(2019) - et al.
Text analytics in industry: challenges, desiderata and trends
Comp. Ind.
(2016) - et al.
SECNLP: a survey of embeddings in clinical natural language processing
Journal of Biomedical Informatics
(2020) - et al.
Oil and gas 4.0 era: a systematic review and outlook
Comp. Ind.
(2019) - et al.
Evolving neural conditional random fields for drilling report classification
J. Petrol. Sci. Eng.
(2020) - et al.
A comparison of word embeddings for the biomedical natural language processing
Journal of Biomedical Informatics
(2018) - et al.
Information extraction and knowledge graph construction from geoscience literature
Comput. Geosci.
(2018) - et al.
A study on similarity and relatedness using distributional and wordnet-based approaches
Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, May 31 - June 5, 2009, Boulder, Colorado, USA
(2009) - et al.
Text summarization techniques: a brief survey
Int. J. Adv. Comput. Sci. Appl.
(2017) - et al.
Publicly available clinical BERT embeddings
Proceedings of the 2nd Clinical Natural Language Processing Workshop
(2019)
Reconhecimento de Entidades Nomeadas na Área da Geologia: Bacias Sedimentares Brasileiras, Ph.D. Thesis
Contextual embeddings: when are they worth it?
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online
Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Semantic search on text and knowledge bases
Found. Trends(®) Inf. Ret.
A neural probabilistic language model
J. Mach. Learn. Res.
Improving the quality and efficiency of operational planning and risk management with ML and NLP
SPE Offshore Europe Conference and Exhibition
Machine learning systems open up access to large volumes of valuable information lying dormant in unstructured documents
Lead. Edge
Enriching word vectors with subword information
TACL
Multimodal distributional semantics
J. Artif. Intell. Res.
From word to sense embeddings: a survey on vector representations of meaning
J. Artif. Intell. Res.
Machine Learning and Natural Language Processing for Automated Analysis of Drilling and Completion Data
Impacts of the review of the Brazilian local content policy on the attractiveness of oil and gas projects
J.World Energy Law Bus.
A unified architecture for natural language processing: deep neural networks with multitask learning
Proceedings of the 25th international conference on Machine learning, ICML’08
Discovering patterns within the drilling reports using artificial intelligence for operation monitoring
Offshore Technology Conference Brasil, Offshore Technology Conference, Rio de Janeiro, Brazil
Multiword expression processing: a survey
Comput. Linguist.
Technology intelligence analysis based on document embedding techniques for oil and gas domain
Offshore Technology Conference Brasil, Offshore Technology Conference, Rio de Janeiro, Brazil
Automatic summarization of technical documents in the oil and gas industry
2019 8th Brazilian Conference on Intelligent Systems (BRACIS)
BERT: pre-training of deep bidirectional transformers for language understanding
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Word vectors, reuse, and replicability: towards a community repository of large-text resources
Proceedings of the 21st Nordic Conference on Computational Linguistics
Retrofitting word vectors to semantic lexicons
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Problems with evaluation of word embeddings using word similarity tasks
Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP
Intrinsic evaluations of word embeddings: what can we do better?
Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP
A primer on neural network models for natural language processing
J. Artif. Intell. Res.
Word Embeddings in Portuguese for the Specific Domain of Oil and Gas
Proceedings of the Rio Oil & Gas Expo and Conference 2018
Deep Learning, Adaptive Computation and Machine Learning
Distributional structure
WORD
Portuguese word embeddings: evaluating on word analogies and natural language tasks
Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology, Sociedade Brasileira de Computação, Uberlândia, Brazil
Advances in natural language processing
Science
Universal language model fine-tuning for text classification
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The oil and gas chat bots are coming
J. Pet. Technol.
Training word embeddings for deep learning in biomedical text mining tasks
2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
Industry specific word embedding and its application in log classification
Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM’19
Text classification algorithms: a survey
Inf.
Cited by (16)
A Zipf's law-based text generation approach for addressing imbalance in entity extraction
2023, Journal of InformetricsA new approach of integrating industry prior knowledge for HAZOP interaction
2023, Journal of Loss Prevention in the Process IndustriesGeoscience language models and their intrinsic evaluation
2022, Applied Computing and GeosciencesCitation Excerpt :A large number of domain-specific language models have been developed with improved understanding of the semantic information in their field of expertise, therefore leading to better performances on the domain-specific tasks, including BioBERT (Lee et al., 2019), E-BERT (Zhang et al., 2021), PatentBERT (Lee and Hsiang, 2019), SciBERT (Beltagy et al., 2019), and TweetBERT (Qudar and Mago, 2020). In contrast, language models that are specific to geoscience are rare, with the exception of some recent NLP downstream applications in translation (Gomes et al., 2021), keyword generation (Qiu et al., 2019a), information retrieval (Qiu et al., 2018), document search (Holden et al., 2019), and other forms of text mining (Enkhsaikhan et al., 2021a, 2021b; Ma et al., 2020; Peters et al., 2018; Wang et al., 2018). The results from this NLP research provide new tools for extracting geoscience knowledge from unstructured text, but tend to focus on evaluating language model performance on specific downstream tasks.
Applications of Natural Language Processing to Geoscience Text Data and Prospectivity Modeling
2023, Natural Resources ResearchEvaluating and mitigating the impact of OCR errors on information retrieval
2023, International Journal on Digital Libraries