Elsevier

Computers in Industry

Volume 124, January 2021, 103347
Computers in Industry

Portuguese word embeddings for the oil and gas industry: Development and evaluation

https://doi.org/10.1016/j.compind.2020.103347Get rights and content

Highlights

  • PetroVec present domain-specific word embeddings for O&G industry in Portuguese.

  • An O&G corpus in Portuguese and a test set for intrinsic evaluation are constructed.

  • The models were trained from different corpora compositions on Word2vec and FastText.

  • Thorough evaluations show domain-specific models outperform a baseline generic model.

  • Qualitative evaluations show the model can encode a highly coherent semantic space.

Abstract

Over the last decades, oil and gas companies have been facing a continuous increase of data collected in unstructured textual format. New disruptive technologies, such as natural language processing and machine learning, present an unprecedented opportunity to extract a wealth of valuable information within these documents. Word embedding models are one of the most fundamental units of natural language processing, enabling machine learning algorithms to achieve great generalization capabilities by providing meaningful representations of words, being able to capture syntactic and semantic features based on their context. However, the oil and gas domain-specific vocabulary represents a challenge to those algorithms, in which words may assume a completely different meaning from a common understanding. The Brazilian pre-salt is an important exploratory frontier for the oil and gas industry, with increasing attractiveness for international investments in exploration and production projects, and most of its documentation is in Portuguese. Moreover, Portuguese is one of the largest languages in terms of number of native speakers. Nonetheless, despite the importance of the petroleum sector of Portuguese speaking countries, specialized public corpora in this domain are scarce. This work proposes PetroVec, a representative set of word embedding models for the specific domain of oil and gas in Portuguese. We gathered an extensive collection of domain-related documents from leading institutions to build a large specialized oil and gas corpus in Portuguese, comprising more than 85 million tokens. To provide an intrinsic evaluation, assessing how well the models can encode domain semantics from the text, we created a semantic relatedness test set, comprising 1,500 word pairs labeled by selected experts in geoscience and petroleum engineering from both academia and industry. In addition, we performed an extrinsic quantitative evaluation on a downstream task of named entity recognition in geoscience, plus a set of qualitative analyses, and conducted a comparative evaluation against a public general-domain embedding model. The obtained results suggest that our domain-specific models outperformed the general model on their ability to represent specialized terminology. To the best of our knowledge, this is the first attempt to generate and evaluate word embedding models for the oil and gas domain in Portuguese. Finally, all the resources developed by this work are made available for public use, including the pre-trained specialized models, corpora, and validation datasets.

Introduction

Over the last decades, companies have gathered huge amounts of data stored in unstructured textual format. Considerable potentially valuable information may be hidden within these increasing volumes of documents such as scientific articles, journals, technical reports, operation logs and laboratory analysis. Having all those textual resources correctly identified and processed is crucial to support a wide range of industrial and academic applications (Ittoo et al., 2016). Considering the intense competitiveness in this industrial environment, it has been economically vital for oil and gas (O&G) companies to fully leverage information from their existing data sources, in order to accelerate the pursuit for maximizing their operational efficiency ([Blinston and Blondelle, 2017], [Lu et al, 2019]).

Some recent advances in natural language processing (NLP) with deep learning algorithms (LeCun et al., 2015, [Goodfellow et al, 2016]) were successfully applied by several industrial applications, providing efficiency improvements in their decision-making processes ([Ittoo et al, 2016], [Blinston and Blondelle, 2017], [Young et al, 2018], [Nooralahzadeh et al, 2018], [Cordeiro et al, 2019]). Those algorithms take unstructured text as their basic input, therefore it is important to obtain suitable mathematical representations for the textual elements. Word embedding models have been efficiently used to provide such meaningful representations, which consist of applying unsupervised learning methods on a text corpus to assign a dense n-dimensional vector to each word in a vocabulary. These models can encode semantic and syntactic similarities between words based on the context where they occur (Mikolov et al., 2013a, 2013; Hartmann et al., 2017). These word vector representations are one of the most fundamental units in any NLP application, since they allow machine learning algorithms to achieve better accuracy due to their great generalization capability (Goldberg, 2016).

Several general-domain embeddings for different languages are available for public use ([Pennington et al, 2014], [Bojanowski et al, 2017], [Fares et al, 2017]), including a few models for Portuguese ([Rodrigues et al, 2016], [Hartmann et al, 2017]). However, the highly technical vocabulary of the O&G domain presents a challenge to NLP applications, since some words may assume a completely different meaning from their conventional interpretation ([Nooralahzadeh et al, 2018], [Cordeiro et al, 2019]). For example, a “christmas tree” is an assembly of valves that provides flow control on a oil or gas well – a vector representation drawn from general-domain corpora would hardly capture the intended meaning. Furthermore, there is consistent evidence that developing specialized word embedding models induced from a domain-specific corpus can significantly improve the quality of their semantic representation ([Lai et al, 2016], Pakhomov et al., 2016, [Nooralahzadeh et al, 2018], Wang et al., 2018a, [Alsentzer et al, 2019], Tshitoyan et al., 2019).

Portuguese is one of the languages with the largest number of native speakers. Moreover, recent auction offers for Brazilian pre-salt exploration blocks and improvements on regulatory frameworks have increased the attractiveness for international investments in exploration and production projects (Clavijo et al., 2019). But, despite the importance of the petroleum sector of Portuguese speaking countries, specialized public corpora in this domain are scarce. Furthermore, technical texts in the O&G domain have known differences in linguistics properties and meanings that differ from general-domain texts, motivating the need for specialized embeddings representations for NLP tasks.

Aiming at filling this gap, we introduce PetroVec, a set of specialized pre-trained word embedding models for the O&G domain in Portuguese. PetroVec was trained on a large O&G corpus, which we assembled using thousands of documents such as periodicals, technical reports, glossaries, academic theses, and articles, published by both academia and major companies. We trained the word embedding models from the specialized corpus using Word2vec (Mikolov et al., 2013a) and FastText (Bojanowski et al., 2017), exploring some variations of corpora composition. Since there is a lack of resources to evaluate word embedding models on this domain and language, we created a test set containing semantic relatedness scores for 1500 word pairs, labeled by experts in geosciences and petroleum engineering from both academia and industry. Hence, we were able to perform intrinsic evaluations, assigning a metric of how well the embeddings can encode semantic properties from the corpus. Additionally, we also performed extrinsic evaluations on a downstream task of named entity recognition in geoscience, plus a set of qualitative analyses. With that, our models were thoroughly evaluated, both quantitatively (with intrinsic and extrinsic evaluations) and qualitatively. Furthermore, we conducted a comprehensive analysis comparing our models and a pre-trained general-domain model in Portuguese (Hartmann et al., 2017). Our findings confirm that our specific-domain models capture semantics in a way that is closer to domain experts, with all evaluation alternatives pointing to the same conclusions.

Finally, all the resources developed in this work are available for public use in our repository2, including the pre-trained word embeddings, corpora, test sets and scripts for preprocessing, training and evaluating the models. The main contributions of this work are as follows: (i) a representative set of domain-specific word embedding models for the O&G industry in Portuguese; (ii) the largest corpus ever reported for this domain and language; and (iii) the first annotated test set for intrinsic semantic evaluation for the O&G domain in Portuguese. We believe that many researchers working in Portuguese O&G domain related projects, both from the industry and academia, can benefit from these resources.

The remainder of this article is organized as follows: Section 2 introduces the background concepts. Section 3 surveys the related work in domain-specific word embeddings. In Section 4, we describe the corpus assembly and the training of the embeddings. Then, the next sections report on the different evaluations we performed. Section 5 presents the intrinsic evaluation. The extrinsic evaluation is detailed in Section 6. Section 7 presents the qualitative evaluation. Finally, Section 8 concludes the article.

Section snippets

Background

Natural language processing (NLP) encompasses a set of computational techniques which aim to provide algorithms the ability to automatically analyze text written in human language. These techniques aim to resolve syntactic structure, word disambiguation, and comprehend the semantic scope of a sentence (Manning and Schutze, 1999). Such algorithms have been successfully applied to many downstream tasks, both in academia and industry, such as automatic machine translation (Vaswani et al., 2017,

Related work

Since the popularization of applying word embedding models in NLP applications, especially after several promising results in deep learning algorithms (Young et al., 2018), there has been an effort to provide good quality pre-trained representations for general purposes. Transfer learning techniques are commonly applied to reuse models originally trained on a general-domain corpus, feeding domain-specific algorithms with those pre-trained embeddings to perform a specific task (Ruder et al., 2019

Corpora and language models

Considering the lack of reference corpora, we first gathered a large collection of public documents in the O&G domain in Portuguese. The collection includes scientific and technical publications retrieved from major universities and leading institutions in this field, such as Petrobras (a Brazilian multinational corporation in the petroleum industry) and the Brazilian National Agency of Petroleum, Natural Gas and Biofuels (ANP) (a federal government agency responsible for the regulation of the

Intrinsic evaluation

The intrinsic evaluation aims to assign a metric of how well the embeddings can encode the semantic and syntactic properties of the text. The process consists of using the models to rate the similarity of pairs of words and compare them to the human perception of similarity ([Baroni et al, 2014], Schnabel et al., 2015, [Gladkova and Drozd, 2016]). After a thorough search, we found no evaluation datasets in the O&G domain for Portuguese. Thus, in order to create a dataset for intrinsic

Extrinsic evaluation

Extrinsic evaluations measure the contribution of a word embedding model when used as input for specific NLP tasks ([Turian et al, 2010], Schnabel et al., 2015), such as automatic text classification, named entity recognition (NER) or part-of-speech tagging. In this work we perform an evaluation for the task of NER in Geoscience related literature. In the next sections, we report on the methodology and the results.

Qualitative evaluation

In addition to the aforementioned intrinsic and extrinsic evaluations, we conducted some experiments on qualitative analyses of semantic relatedness for sets of terms representing the O&G technical vocabulary. These evaluations include word analogies, semantic space coherence and categorization ([Turian et al, 2010], Schnabel et al., 2015).

Conclusion

In this work, we introduced PetroVec, a set of domain-specific pre-trained word embedding models for the O&G industry in Portuguese. These embeddings were induced from a large collection of textual resources gathered from leading institutions in this domain. We also created an annotated test set, labeled by experts in geosciences and petroleum engineering, designed to perform intrinsic semantic metrics. The generated models were thoroughly evaluated, both quantitatively (with intrinsic and

Conflict of interest

The authors declare no conflict of interest.

Declaration of Competing Interest

The authors report no declarations of interest.

Acknowledgments

This work has been partially funded by CENPES Petrobras, CNPq-Brazil, and Capes Finance Code 001.

References (86)

  • D.O.F. Amaral

    Reconhecimento de Entidades Nomeadas na Área da Geologia: Bacias Sedimentares Brasileiras, Ph.D. Thesis

    (2017)
  • S. Arora et al.

    Contextual embeddings: when are they worth it?

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online

    (2020)
  • M. Baroni et al.

    Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors

    Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    (2014)
  • H. Bast et al.

    Semantic search on text and knowledge bases

    Found. Trends(®) Inf. Ret.

    (2016)
  • Y. Bengio et al.

    A neural probabilistic language model

    J. Mach. Learn. Res.

    (2003)
  • C.E. Birnie et al.

    Improving the quality and efficiency of operational planning and risk management with ML and NLP

    SPE Offshore Europe Conference and Exhibition

    (2019)
  • K. Blinston et al.

    Machine learning systems open up access to large volumes of valuable information lying dormant in unstructured documents

    Lead. Edge

    (2017)
  • P. Bojanowski et al.

    Enriching word vectors with subword information

    TACL

    (2017)
  • E. Bruni et al.

    Multimodal distributional semantics

    J. Artif. Intell. Res.

    (2014)
  • J. Camacho-Collados et al.

    From word to sense embeddings: a survey on vector representations of meaning

    J. Artif. Intell. Res.

    (2018)
  • D. Castiñeira et al.

    Machine Learning and Natural Language Processing for Automated Analysis of Drilling and Completion Data

    (2018)
  • W. Clavijo et al.

    Impacts of the review of the Brazilian local content policy on the attractiveness of oil and gas projects

    J.World Energy Law Bus.

    (2019)
  • R. Collobert et al.

    A unified architecture for natural language processing: deep neural networks with multitask learning

    Proceedings of the 25th international conference on Machine learning, ICML’08

    (2008)
  • D. Colombo et al.

    Discovering patterns within the drilling reports using artificial intelligence for operation monitoring

    Offshore Technology Conference Brasil, Offshore Technology Conference, Rio de Janeiro, Brazil

    (2019)
  • M. Constant et al.

    Multiword expression processing: a survey

    Comput. Linguist.

    (2017)
  • F.C. Cordeiro et al.

    Technology intelligence analysis based on document embedding techniques for oil and gas domain

    Offshore Technology Conference Brasil, Offshore Technology Conference, Rio de Janeiro, Brazil

    (2019)
  • J.M. Correia Marques et al.

    Automatic summarization of technical documents in the oil and gas industry

    2019 8th Brazilian Conference on Intelligent Systems (BRACIS)

    (2019)
  • J. Devlin et al.

    BERT: pre-training of deep bidirectional transformers for language understanding

    Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

    (2019)
  • M. Fares et al.

    Word vectors, reuse, and replicability: towards a community repository of large-text resources

    Proceedings of the 21st Nordic Conference on Computational Linguistics

    (2017)
  • M. Faruqui et al.

    Retrofitting word vectors to semantic lexicons

    Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

    (2015)
  • M. Faruqui et al.

    Problems with evaluation of word embeddings using word similarity tasks

    Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP

    (2016)
  • A. Gladkova et al.

    Intrinsic evaluations of word embeddings: what can we do better?

    Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP

    (2016)
  • Y. Goldberg

    A primer on neural network models for natural language processing

    J. Artif. Intell. Res.

    (2016)
  • Diogo Gomes et al.

    Word Embeddings in Portuguese for the Specific Domain of Oil and Gas

    Proceedings of the Rio Oil & Gas Expo and Conference 2018

    (2018)
  • I. Goodfellow et al.

    Deep Learning, Adaptive Computation and Machine Learning

    (2016)
  • Z.S. Harris

    Distributional structure

    WORD

    (1954)
  • N. Hartmann et al.

    Portuguese word embeddings: evaluating on word analogies and natural language tasks

    Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology, Sociedade Brasileira de Computação, Uberlândia, Brazil

    (2017)
  • J. Hirschberg et al.

    Advances in natural language processing

    Science

    (2015)
  • J. Howard et al.

    Universal language model fine-tuning for text classification

    Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    (2018)
  • T. Jacobs

    The oil and gas chat bots are coming

    J. Pet. Technol.

    (2019)
  • Z. Jiang et al.

    Training word embeddings for deep learning in biomedical text mining tasks

    2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

    (2015)
  • E. Khabiri et al.

    Industry specific word embedding and its application in log classification

    Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM’19

    (2019)
  • K. Kowsari et al.

    Text classification algorithms: a survey

    Inf.

    (2019)
  • Cited by (16)

    • A new approach of integrating industry prior knowledge for HAZOP interaction

      2023, Journal of Loss Prevention in the Process Industries
    • Geoscience language models and their intrinsic evaluation

      2022, Applied Computing and Geosciences
      Citation Excerpt :

      A large number of domain-specific language models have been developed with improved understanding of the semantic information in their field of expertise, therefore leading to better performances on the domain-specific tasks, including BioBERT (Lee et al., 2019), E-BERT (Zhang et al., 2021), PatentBERT (Lee and Hsiang, 2019), SciBERT (Beltagy et al., 2019), and TweetBERT (Qudar and Mago, 2020). In contrast, language models that are specific to geoscience are rare, with the exception of some recent NLP downstream applications in translation (Gomes et al., 2021), keyword generation (Qiu et al., 2019a), information retrieval (Qiu et al., 2018), document search (Holden et al., 2019), and other forms of text mining (Enkhsaikhan et al., 2021a, 2021b; Ma et al., 2020; Peters et al., 2018; Wang et al., 2018). The results from this NLP research provide new tools for extracting geoscience knowledge from unstructured text, but tend to focus on evaluating language model performance on specific downstream tasks.

    View all citing articles on Scopus
    View full text