当前位置: X-MOL 学术Comput. Ind. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Portuguese word embeddings for the oil and gas industry: Development and evaluation
Computers in Industry ( IF 10.0 ) Pub Date : 2020-11-26 , DOI: 10.1016/j.compind.2020.103347
Diogo da Silva Magalhães Gomes , Fábio Corrêa Cordeiro , Bernardo Scapini Consoli , Nikolas Lacerda Santos , Viviane Pereira Moreira , Renata Vieira , Silvia Moraes , Alexandre Gonçalves Evsukoff

Over the last decades, oil and gas companies have been facing a continuous increase of data collected in unstructured textual format. New disruptive technologies, such as natural language processing and machine learning, present an unprecedented opportunity to extract a wealth of valuable information within these documents. Word embedding models are one of the most fundamental units of natural language processing, enabling machine learning algorithms to achieve great generalization capabilities by providing meaningful representations of words, being able to capture syntactic and semantic features based on their context. However, the oil and gas domain-specific vocabulary represents a challenge to those algorithms, in which words may assume a completely different meaning from a common understanding. The Brazilian pre-salt is an important exploratory frontier for the oil and gas industry, with increasing attractiveness for international investments in exploration and production projects, and most of its documentation is in Portuguese. Moreover, Portuguese is one of the largest languages in terms of number of native speakers. Nonetheless, despite the importance of the petroleum sector of Portuguese speaking countries, specialized public corpora in this domain are scarce. This work proposes PetroVec, a representative set of word embedding models for the specific domain of oil and gas in Portuguese. We gathered an extensive collection of domain-related documents from leading institutions to build a large specialized oil and gas corpus in Portuguese, comprising more than 85 million tokens. To provide an intrinsic evaluation, assessing how well the models can encode domain semantics from the text, we created a semantic relatedness test set, comprising 1,500 word pairs labeled by selected experts in geoscience and petroleum engineering from both academia and industry. In addition, we performed an extrinsic quantitative evaluation on a downstream task of named entity recognition in geoscience, plus a set of qualitative analyses, and conducted a comparative evaluation against a public general-domain embedding model. The obtained results suggest that our domain-specific models outperformed the general model on their ability to represent specialized terminology. To the best of our knowledge, this is the first attempt to generate and evaluate word embedding models for the oil and gas domain in Portuguese. Finally, all the resources developed by this work are made available for public use, including the pre-trained specialized models, corpora, and validation datasets.



中文翻译:

石油和天然气行业的葡萄牙语单词嵌入:发展和评估

在过去的几十年中,石油和天然气公司面临着以非结构化文本格式收集的数据的持续增长。新的破坏性技术,例如自然语言处理和机器学习,为在这些文档中提取大量有价值的信息提供了前所未有的机会。单词嵌入模型是自然语言处理的最基本单位之一,它使机器学习算法能够通过提供单词的有意义的表示形式来实现强大的泛化能力,并能够根据其上下文捕获语法和语义特征。但是,石油和天然气领域特定的词汇表对这些算法构成了挑战,其中的单词可能具有与通常的理解完全不同的含义。巴西的预盐是石油和天然气行业的重要勘探边界,对勘探和生产项目的国际投资越来越具有吸引力,其大部分文献都使用葡萄牙语。此外,就以母语为母语的人而言,葡萄牙语是最大的语言之一。尽管如此,尽管葡萄牙语国家的石油部门非常重要,但在这一领域的专业公共语料库仍然很稀缺。这项工作提出 尽管葡萄牙语国家的石油部门很重要,但该领域的专业公共语料库仍然很少。这项工作提出 尽管葡萄牙语国家的石油部门很重要,但该领域的专业公共语料库仍然很少。这项工作提出PetroVec,这是葡萄牙语中石油和天然气特定领域的代表性词嵌入模型集。我们从领先机构中收集了与领域相关的大量文档,以葡萄牙语建立了一个大型的专业石油和天然气语料库,其中包含超过8500万个令牌。为了提供一个内在的评估,评估模型可以如何很好地编码文本中的域语义,我们创建了一个语义相关性测试集,其中包括1500个单词对,这些单词对由来自学术界和工业界的地质科学和石油工程学的专家选定。此外,我们对地球科学中命名实体识别的下游任务进行了外部定量评估,并进行了一系列定性分析,并针对公共通用域嵌入模型进行了比较评估。获得的结果表明,我们的领域特定模型在表示特定术语方面的性能优于一般模型。据我们所知,这是第一次尝试生成和评估葡萄牙语中油气领域的词嵌入模型。最后,这项工作开发的所有资源都可供公众使用,包括经过预先训练的专业模型,语料库和验证数据集。

更新日期:2020-11-27
down
wechat
bug