Skip to main content

Advertisement

Log in

Automatic spatiotemporal and semantic information extraction from unstructured geoscience reports using text mining techniques

  • Research Article
  • Published:
Earth Science Informatics Aims and scope Submit manuscript

Abstract

A large number of georeferenced quantitative data about rock and geoscience surveys are buried in geological documents and remain unused. Data analytics and information extraction offer opportunities to use this data for improved understanding of ore forming processes and to enhance our knowledge. Extracting spatiotemporal and semantic information from a set of geological documents enables us to develop a rich representation of the geoscience knowledge recorded in unstructured text written in Chinese. This paper presents the workflow for spatiotemporal and semantic information extraction, which is a geological document analysis approach that uses automated techniques for browsing and searching relevant geological content. The developed workflow applies spatial and temporal gazetteer matching, pattern-based rules and spatiotemporal relationship extraction to identify and label terms in geological text documents. It offers a representation of contextual information in knowledge graph form, extracts a set of relevant tables and figures, and queries a list of relevant documents by using geological topic information. Here, text mining techniques are used to facilitate the analysis of geological knowledge and to show the effectiveness of text analysis for improving the rapid assessment of a massive number of documents. Furthermore, autogenerated keyword suggestions derived from extracted keyword associations are used to reduce document search efforts. This research illustrates the usefulness and effectiveness of the developed information extraction workflow and demonstrates the potential of incorporating text mining and NLP techniques for geoscience.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  • Abraham S, Mas S, Bernard L (2018) Extraction of spatio-temporal data about historical events from text documents. Trans GIS 22(3):677–696

    Article  Google Scholar 

  • Clark, C, Divvala, S, (2016). 2.0: mining figures from research papers. In: IEEE/ACM joint conference on digital libraries (JCDL) IEEE, pp. 143–152

  • Cox S, Richard SM (2015) A geologic timescale ontology and service. Earth Sci Inf 8(1):5–19

    Article  Google Scholar 

  • De Sa C, Ratner A, Re C, Shin J, Wang F, Wu S, Zhang C (2016) DeepDive: declarative Knowledge Base construction. International conference on management of data 45(1):60–67

    Google Scholar 

  • Du S, Guo L (2016) Similarity measurements on multi-scale qualitative locations. Trans GIS 20(6):824–847

    Article  Google Scholar 

  • Du S, Feng C, Guo L (2015) Integrative representation and inference of qualitative locations about points, lines, and polygons. Int J Geogr Inf Sci 29(6):980–1006

    Article  Google Scholar 

  • Du S, Wang X, Feng C, Zhang X (2017) Classifying natural-language spatial relation terms with random forest algorithm. Int J Geogr Inf Sci 31(3):542–568

    Article  Google Scholar 

  • Enkhsaikhan, M, Liu, W, Holden, EJ, Duuring, P, (2018). Towards geological knowledge discovery using vector-based semantic similarity. In: proceedings of the international conference on advanced data mining and applications. Springer, Cham, pp. 224–237

  • Harisinghaney, A, Dixit, A, Gupta, S, Arora, A, (2014). Text and image based spam email classification using KNN, Naïve Bayes and reverse DBSCAN algorithm. In: proceedings of international conference on optimization, Reliabilty, and information technology (ICROIT). IEEE, pp. 153–155

  • Holden E, Liu W, Horrocks T, Wang R, Wedge D, Duuring P, Beardsmore T (2019) GeoDocA - fast analysis of geological content in mineral exploration reports: a text mining approach. Ore Geol Rev 111:102919

    Article  Google Scholar 

  • Hovy, E, Lin, CY, (1998). Automated text summarization and the SUMMARIST system. In: proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998(TIPSTER “98). Association for Computational Linguistics, Stroudsburg, PA, pp. 197–214

  • Hwang J, Nam KW, Ryu KH (2012) Designing and implementing a geologic information system using a spatiotemporal ontology model for a geologic map of Korea. Comput Geosci 48:173–186

    Article  Google Scholar 

  • Ireson, N, Ciravegna, F, Califf, ME, Freitag, D, Kushmerick, N. and Lavelli, A (2005). Evaluating machine learning for information extraction. International conference on machine learning

  • Jones KS (1972) A statistical interpretation of term specificity and its applications in retrieval. J Doc 28(1):11–21

    Article  Google Scholar 

  • Konkol M, Brychcín T, Konopík M (2015) Latent semantics in named entity recognition. Expert Syst Appl 42(7):3470–3479

    Article  Google Scholar 

  • Lima, LA, Gornitz, N, Varella, LE, Vellasco, MM, Muller, K and Nakajima, S (2017). Porosity estimation by semi-supervised learning with sparsely available labeled samples. Computers & Geosciences, 33–48

  • Liu, K and Elgohary, N (2017). Ontology-based semi-supervised conditional random fields for automated information extraction from bridge inspection reports. Automation in construction, 313-327

  • Liu W, Chung BC, Wang R, Ng JQ, Morlet N (2015) A genetic algorithm enabled ensemble for unsupervised medical term extraction from clinical letters. Health information science 3(1):1–14

    Article  Google Scholar 

  • Luo X, Zhou W, Wang W, Zhu Y, Deng J (2018a) Attention-based relation extraction with bidirectional gated recurrent unit and highway network in the analysis of geological data[J]. IEEE Access 6:5705–5715

    Article  Google Scholar 

  • Luo, X, Zhou, W, Wang, W, Zhu, Y and Deng, J (2018b). Attention-Based Relation Extraction With Bidirectional Gated Recurrent Unit and Highway Network in the Analysis of Geological Data. IEEE Access, 5705–5715

  • Ma X, Carranza EJ, Wu C, Der Meer FD, Liu G (2011) A SKOS-based multilingual thesaurus of geological time scale for interoperability of online geological maps. Comput Geosci 37(10):1602–1615

    Article  Google Scholar 

  • Ma K, Wu L, Tao L, Li W, Xie Z (2018) Matching descriptions to spatial entities using a Siamese hierarchical attention network. IEEE Access 6:28064–28072

    Article  Google Scholar 

  • Manning, CD, Manning, CD and Schütze, H (1999). Foundations of statistical natural language processing. MIT press

  • Moens, MF (2006). Information extraction: algorithms and prospects in a retrieval context (Vol. 21). Springer Science & Business Media

  • Nadeau, D, Sekine, S, (2007). A survey of named entity recognition and classification. Linguisticae Investigationes 30 (1), 3–26 Publisher: John Benjamins publishing company

  • Paulus, R, Xiong, C and Socher, R (2018). A deep reinforced model for abstractive summarization. International conference on learning representations

  • Peters SE, McClennen M (2015) The Paleobiology database application programming interface. Paleobiology 42:1–7

    Article  Google Scholar 

  • Peters SE, Zhang C, Livny M, Re C (2014) A machine reading system for assembling synthetic paleontological databases. PLoS One 9(12):e113523

    Article  Google Scholar 

  • Peters SE, Husson JM, Wilcots J (2017) The rise and fall of stromatolites in shallow marine environments. Geology 45(6):487–490

    Article  Google Scholar 

  • Qiu Q, Xie Z, Wu L (2018a) A cyclic self-learning Chinese word segmentation for the geoscience domain. Geomatica 72(1):16–26

    Article  Google Scholar 

  • Qiu, Q, Xie, Z, Wu, L and Li, W (2018b). DGeoSegmenter: A dictionary-based Chinese word segmenter for the geoscience domain. Computers & Geosciences, 1–11

  • Qiu, Q, Xie, Z, Wu, L and Li, W (2019a). Geoscience keyphrase extraction algorithm using enhanced word embedding. Expert Systems With Applications, 157–169

  • Qiu, Q, Xie, Z, Wu, L and Tao, L (2019b). GNER: a generative model for geological named entity recognition without labeled data using deep learning. Earth and Space Science

  • Qiu, Q, Xie, Z, Wu, L and Tao, L (2020). Dictionary-based automated information extraction from geological documents using a deep learning algorithm. Earth and Space Science, 7, e2019EA000993. https://doi.org/10.1029/2019EA000993

  • Rafieiasl, J and Nickabadi, A (2017). TSAKE: a topical and structural automatic keyphrase extractor. Applied soft computing, 620-630

  • Schuhmacher, M, Ponzetto, SP, (2014). Knowledge-based graph document modeling. In: proceedings of the 7th ACM international conference on web search and data mining, pp. 543–552

  • Shi, L, Jianping, C and Jie, X (2018). Prospecting information extraction by text mining based on convolutional neural networks–a case study of the Lala copper deposit, China. IEEE access, 52286-52297

  • Toutanvoa, K and Manning, CD (2000). Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. Empirical methods in natural language processing: 63-70

  • Wang, R, Liu, W, McDonald, C, (2015). Using word embeddings to enhance keyword identification for scientific publications. In: Databases Theory and Applications. Springer, pp. 257–268

  • Wang C, Ma X, Chen J (2018a) Ontology-driven data integration and visualization for exploring regional geologic time and paleontological information. Comput Geosci 115:12–19

    Article  Google Scholar 

  • Wang C, Ma X, Chen J, Chen J (2018b) Information extraction and knowledge graph construction from geoscience literature. Comput Geosci 112:112–120

    Article  Google Scholar 

  • Wong W, Liu W, Bennamoun M (2012) Ontology learning from text: a look back and into the future. ACM Comput Surv 44(4):20

    Article  Google Scholar 

  • Wu, L, Xue, L, Li, C, Lv, X, Chen, Z, Jiang, B, Guo M and Xie, Z (2017). A knowledge-driven geospatially enabled framework for geological big data. ISPRS Int J Geo Inf, 6(6)

  • Yang S, Lu W, Yang D, Li X, Wu C, Wei B (2017) KeyphraseDS: automatic generation of survey by exploiting keyphrase information. Neurocomputing 224:58–70

    Article  Google Scholar 

  • Yang, D, Wang, S, Li, Z, (2018). Ensemble neural relation extraction with adaptive boosting. In: proceedings of the 27th international joint conference on artificial intelligence. IJCAI’18 AAAI press, pp. 4532–4538. http://dl.acm.org/citation.cfm? Id=3304222.3304400

  • Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learning based natural language processing. Ieee. Computational intelligenCe magazine 13(3):55–75

    Article  Google Scholar 

  • Zhang J, Elgohary N (2016) Semantic NLP-based information extraction from construction regulatory documents for automated compliance checking. J Comput Civ Eng 30(2):04015014

    Article  Google Scholar 

  • Zhang, Y, Chen, M, Liu, L, (2015). A review on text mining. In: proceedings of the 6th IEEE international conference on software engineering and service science (ICSESS) IEEE, pp. 681–685

  • Zhang F, Fleyeh H, Wang X, Lu M (2019) Construction site accident analysis using text mining and natural language processing techniques. Autom Constr 99:238–248

    Article  Google Scholar 

  • Zhou, P and Elgohary, N (2017). Ontology-based automated information extraction from building energy conservation codes. Automation in construction, 103-117

  • Zhou P, Xu J, Qi Z, Bao H, Chen Z, Xu B (2018) Distant supervision for relation extraction with hierarchical selective attention. Neural Netw 108:240–247

    Article  Google Scholar 

Download references

Acknowledgments

We would like to thank the anonymous reviewers for carefully reading this paper and their very useful comments. This study was financially supported by the National Natural Science Foundation of China (U1711267, 41671400, 41871311, 41871305), the National Key Research and Development Program (2018YFB0505500, 2018YFB0505504).

Author information

Authors and Affiliations

Authors

Contributions

Conceived and designed the experiments: Qinjun Qiu, Liufeng Tao and Zhong Xie; Performed the experiments: Qinjun Qiu, Liufeng Tao, and Zhong Xie; Analyzed the data: Qinjun Qiu, Liufeng Tao, and Zhong Xie; Wrote the paper: Qinjun Qiu, Liang Wu, Zhong Xie and Liufeng Tao.

Corresponding author

Correspondence to Liufeng Tao.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Communicated by: H. Babaie

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qiu, Q., Xie, Z., Wu, L. et al. Automatic spatiotemporal and semantic information extraction from unstructured geoscience reports using text mining techniques. Earth Sci Inform 13, 1393–1410 (2020). https://doi.org/10.1007/s12145-020-00527-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12145-020-00527-9

Keywords

Navigation