Abstract
A large number of georeferenced quantitative data about rock and geoscience surveys are buried in geological documents and remain unused. Data analytics and information extraction offer opportunities to use this data for improved understanding of ore forming processes and to enhance our knowledge. Extracting spatiotemporal and semantic information from a set of geological documents enables us to develop a rich representation of the geoscience knowledge recorded in unstructured text written in Chinese. This paper presents the workflow for spatiotemporal and semantic information extraction, which is a geological document analysis approach that uses automated techniques for browsing and searching relevant geological content. The developed workflow applies spatial and temporal gazetteer matching, pattern-based rules and spatiotemporal relationship extraction to identify and label terms in geological text documents. It offers a representation of contextual information in knowledge graph form, extracts a set of relevant tables and figures, and queries a list of relevant documents by using geological topic information. Here, text mining techniques are used to facilitate the analysis of geological knowledge and to show the effectiveness of text analysis for improving the rapid assessment of a massive number of documents. Furthermore, autogenerated keyword suggestions derived from extracted keyword associations are used to reduce document search efforts. This research illustrates the usefulness and effectiveness of the developed information extraction workflow and demonstrates the potential of incorporating text mining and NLP techniques for geoscience.
Similar content being viewed by others
References
Abraham S, Mas S, Bernard L (2018) Extraction of spatio-temporal data about historical events from text documents. Trans GIS 22(3):677–696
Clark, C, Divvala, S, (2016). 2.0: mining figures from research papers. In: IEEE/ACM joint conference on digital libraries (JCDL) IEEE, pp. 143–152
Cox S, Richard SM (2015) A geologic timescale ontology and service. Earth Sci Inf 8(1):5–19
De Sa C, Ratner A, Re C, Shin J, Wang F, Wu S, Zhang C (2016) DeepDive: declarative Knowledge Base construction. International conference on management of data 45(1):60–67
Du S, Guo L (2016) Similarity measurements on multi-scale qualitative locations. Trans GIS 20(6):824–847
Du S, Feng C, Guo L (2015) Integrative representation and inference of qualitative locations about points, lines, and polygons. Int J Geogr Inf Sci 29(6):980–1006
Du S, Wang X, Feng C, Zhang X (2017) Classifying natural-language spatial relation terms with random forest algorithm. Int J Geogr Inf Sci 31(3):542–568
Enkhsaikhan, M, Liu, W, Holden, EJ, Duuring, P, (2018). Towards geological knowledge discovery using vector-based semantic similarity. In: proceedings of the international conference on advanced data mining and applications. Springer, Cham, pp. 224–237
Harisinghaney, A, Dixit, A, Gupta, S, Arora, A, (2014). Text and image based spam email classification using KNN, Naïve Bayes and reverse DBSCAN algorithm. In: proceedings of international conference on optimization, Reliabilty, and information technology (ICROIT). IEEE, pp. 153–155
Holden E, Liu W, Horrocks T, Wang R, Wedge D, Duuring P, Beardsmore T (2019) GeoDocA - fast analysis of geological content in mineral exploration reports: a text mining approach. Ore Geol Rev 111:102919
Hovy, E, Lin, CY, (1998). Automated text summarization and the SUMMARIST system. In: proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998(TIPSTER “98). Association for Computational Linguistics, Stroudsburg, PA, pp. 197–214
Hwang J, Nam KW, Ryu KH (2012) Designing and implementing a geologic information system using a spatiotemporal ontology model for a geologic map of Korea. Comput Geosci 48:173–186
Ireson, N, Ciravegna, F, Califf, ME, Freitag, D, Kushmerick, N. and Lavelli, A (2005). Evaluating machine learning for information extraction. International conference on machine learning
Jones KS (1972) A statistical interpretation of term specificity and its applications in retrieval. J Doc 28(1):11–21
Konkol M, Brychcín T, Konopík M (2015) Latent semantics in named entity recognition. Expert Syst Appl 42(7):3470–3479
Lima, LA, Gornitz, N, Varella, LE, Vellasco, MM, Muller, K and Nakajima, S (2017). Porosity estimation by semi-supervised learning with sparsely available labeled samples. Computers & Geosciences, 33–48
Liu, K and Elgohary, N (2017). Ontology-based semi-supervised conditional random fields for automated information extraction from bridge inspection reports. Automation in construction, 313-327
Liu W, Chung BC, Wang R, Ng JQ, Morlet N (2015) A genetic algorithm enabled ensemble for unsupervised medical term extraction from clinical letters. Health information science 3(1):1–14
Luo X, Zhou W, Wang W, Zhu Y, Deng J (2018a) Attention-based relation extraction with bidirectional gated recurrent unit and highway network in the analysis of geological data[J]. IEEE Access 6:5705–5715
Luo, X, Zhou, W, Wang, W, Zhu, Y and Deng, J (2018b). Attention-Based Relation Extraction With Bidirectional Gated Recurrent Unit and Highway Network in the Analysis of Geological Data. IEEE Access, 5705–5715
Ma X, Carranza EJ, Wu C, Der Meer FD, Liu G (2011) A SKOS-based multilingual thesaurus of geological time scale for interoperability of online geological maps. Comput Geosci 37(10):1602–1615
Ma K, Wu L, Tao L, Li W, Xie Z (2018) Matching descriptions to spatial entities using a Siamese hierarchical attention network. IEEE Access 6:28064–28072
Manning, CD, Manning, CD and Schütze, H (1999). Foundations of statistical natural language processing. MIT press
Moens, MF (2006). Information extraction: algorithms and prospects in a retrieval context (Vol. 21). Springer Science & Business Media
Nadeau, D, Sekine, S, (2007). A survey of named entity recognition and classification. Linguisticae Investigationes 30 (1), 3–26 Publisher: John Benjamins publishing company
Paulus, R, Xiong, C and Socher, R (2018). A deep reinforced model for abstractive summarization. International conference on learning representations
Peters SE, McClennen M (2015) The Paleobiology database application programming interface. Paleobiology 42:1–7
Peters SE, Zhang C, Livny M, Re C (2014) A machine reading system for assembling synthetic paleontological databases. PLoS One 9(12):e113523
Peters SE, Husson JM, Wilcots J (2017) The rise and fall of stromatolites in shallow marine environments. Geology 45(6):487–490
Qiu Q, Xie Z, Wu L (2018a) A cyclic self-learning Chinese word segmentation for the geoscience domain. Geomatica 72(1):16–26
Qiu, Q, Xie, Z, Wu, L and Li, W (2018b). DGeoSegmenter: A dictionary-based Chinese word segmenter for the geoscience domain. Computers & Geosciences, 1–11
Qiu, Q, Xie, Z, Wu, L and Li, W (2019a). Geoscience keyphrase extraction algorithm using enhanced word embedding. Expert Systems With Applications, 157–169
Qiu, Q, Xie, Z, Wu, L and Tao, L (2019b). GNER: a generative model for geological named entity recognition without labeled data using deep learning. Earth and Space Science
Qiu, Q, Xie, Z, Wu, L and Tao, L (2020). Dictionary-based automated information extraction from geological documents using a deep learning algorithm. Earth and Space Science, 7, e2019EA000993. https://doi.org/10.1029/2019EA000993
Rafieiasl, J and Nickabadi, A (2017). TSAKE: a topical and structural automatic keyphrase extractor. Applied soft computing, 620-630
Schuhmacher, M, Ponzetto, SP, (2014). Knowledge-based graph document modeling. In: proceedings of the 7th ACM international conference on web search and data mining, pp. 543–552
Shi, L, Jianping, C and Jie, X (2018). Prospecting information extraction by text mining based on convolutional neural networks–a case study of the Lala copper deposit, China. IEEE access, 52286-52297
Toutanvoa, K and Manning, CD (2000). Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. Empirical methods in natural language processing: 63-70
Wang, R, Liu, W, McDonald, C, (2015). Using word embeddings to enhance keyword identification for scientific publications. In: Databases Theory and Applications. Springer, pp. 257–268
Wang C, Ma X, Chen J (2018a) Ontology-driven data integration and visualization for exploring regional geologic time and paleontological information. Comput Geosci 115:12–19
Wang C, Ma X, Chen J, Chen J (2018b) Information extraction and knowledge graph construction from geoscience literature. Comput Geosci 112:112–120
Wong W, Liu W, Bennamoun M (2012) Ontology learning from text: a look back and into the future. ACM Comput Surv 44(4):20
Wu, L, Xue, L, Li, C, Lv, X, Chen, Z, Jiang, B, Guo M and Xie, Z (2017). A knowledge-driven geospatially enabled framework for geological big data. ISPRS Int J Geo Inf, 6(6)
Yang S, Lu W, Yang D, Li X, Wu C, Wei B (2017) KeyphraseDS: automatic generation of survey by exploiting keyphrase information. Neurocomputing 224:58–70
Yang, D, Wang, S, Li, Z, (2018). Ensemble neural relation extraction with adaptive boosting. In: proceedings of the 27th international joint conference on artificial intelligence. IJCAI’18 AAAI press, pp. 4532–4538. http://dl.acm.org/citation.cfm? Id=3304222.3304400
Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learning based natural language processing. Ieee. Computational intelligenCe magazine 13(3):55–75
Zhang J, Elgohary N (2016) Semantic NLP-based information extraction from construction regulatory documents for automated compliance checking. J Comput Civ Eng 30(2):04015014
Zhang, Y, Chen, M, Liu, L, (2015). A review on text mining. In: proceedings of the 6th IEEE international conference on software engineering and service science (ICSESS) IEEE, pp. 681–685
Zhang F, Fleyeh H, Wang X, Lu M (2019) Construction site accident analysis using text mining and natural language processing techniques. Autom Constr 99:238–248
Zhou, P and Elgohary, N (2017). Ontology-based automated information extraction from building energy conservation codes. Automation in construction, 103-117
Zhou P, Xu J, Qi Z, Bao H, Chen Z, Xu B (2018) Distant supervision for relation extraction with hierarchical selective attention. Neural Netw 108:240–247
Acknowledgments
We would like to thank the anonymous reviewers for carefully reading this paper and their very useful comments. This study was financially supported by the National Natural Science Foundation of China (U1711267, 41671400, 41871311, 41871305), the National Key Research and Development Program (2018YFB0505500, 2018YFB0505504).
Author information
Authors and Affiliations
Contributions
Conceived and designed the experiments: Qinjun Qiu, Liufeng Tao and Zhong Xie; Performed the experiments: Qinjun Qiu, Liufeng Tao, and Zhong Xie; Analyzed the data: Qinjun Qiu, Liufeng Tao, and Zhong Xie; Wrote the paper: Qinjun Qiu, Liang Wu, Zhong Xie and Liufeng Tao.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Communicated by: H. Babaie
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Qiu, Q., Xie, Z., Wu, L. et al. Automatic spatiotemporal and semantic information extraction from unstructured geoscience reports using text mining techniques. Earth Sci Inform 13, 1393–1410 (2020). https://doi.org/10.1007/s12145-020-00527-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12145-020-00527-9