Methodology for refining subject terms and supporting subject indexing with taxonomy: A case study of the APO digital repository

https://doi.org/10.1016/j.dss.2021.113542Get rights and content

Highlights

  • Propose a methodology for refining existing subject terms by estimating their frequencies and semantics, and for inducing a taxonomy from the refined subject terms by integrating their mutual usages.

  • Provide thorough analysis of our proposed methodology using the APO (Analysis & Policy Observatory) digital repository to show the applicability of the methodology

  • Measure the generalisability of the proposed taxonomy inducing method, in comparison with the state–of-the-art taxonomy inducing method, TaxoFinder

Abstract

In digital repositories, it is crucial to refine existing subject terms and exploit a taxonomy with subject terms, in order to promote information retrieval tasks such as indexing, cataloging and searching of digital documents. In this paper, we address how to refine an existing set of subject terms, often containing irrelevant ones or creating noise, that are used to index digital documents. Further, we present how to automatically induce a subject term taxonomy to capture and utilise the semantic relations among subject terms. Most related works have little studied these problems, focusing mostly on creating subject terms or building a taxonomy of key terms from text documents. We propose a methodology2 for refining an existing set of subject terms in a digital repository by identifying their semantics, as well as inducing a taxonomy with subject terms by analysing their mutual usages, maximising their semantic relatedness. Then, we present a case study using the (Analysis & Policy Observatory) APO digital repository to analyse the proposed methodology and demonstrate its applicability. Further, to validate the generalisability of the proposed taxonomy inducing method, we evaluate it using a gold-standard taxonomy in life sciences, Medical Subject Headings (MeSH), in comparison with the state–of-the-art taxonomy inducing method, TaxoFinder. Our evaluation shows that our methodology has a high potential for refining an existing set of subject terms and capturing their semantic relationships by inducing a subject term taxonomy.

Introduction

Digital repositories (or libraries3) have emerged as essential information systems that serve repositories of digital documents as well as provide search and retrieval mechanisms via user interaction [1]. Advances in information retrieval research have significantly enhanced the access, functionality and technical capabilities of digital repositories. In a digital repository, subject terms are an essential block for descriptive cataloging and indexing documents [2]. These terms are usually derived from some type of controlled vocabulary such as predefined keywords associated with the underlying documents [3]. Thus, subject terms are considered as a key asset in a digital repository, and these terms significantly contribute not only to describing information or knowledge pieces of digital documents but also to improving the relevance of search results [4]. As a result, to make effective use of digital documents, developing and utilising useful subject terms plays a crucial role in determining the quality of a digital repository.

Creating a taxonomy of subject terms is another key to promoting information retrieval tasks such as indexing, cataloging and searching of information from digital documents [[5], [6], [7]]. In digital repositories, we consider three benefits of using a subject term taxonomy. First, the ability to index digital documents can be improved, regardless of indexing methods (i.e., manually, semi-automatically or automatically), by utilising semantic associations between the subject terms induced from the taxonomy. Second, using semantic knowledge about the taxonomy provides better understanding about the underlying subject terms for humans, thereby facilitating their refinement (i.e., the improvement or clarification of subject terms). Third, a subject term taxonomy can improve researchers or end-users to search documents by linking and suggesting related subject terms and by offering their hierarchical structure that helps to navigate about them more easily.

In this paper, we tackle two challenging problems. The first is to refine an existing set of subject terms that often contain irrelevant ones in a digital repository. We refer to an irrelevant subject term as the subject term that may not be useful for indexing the document collection in the target repository. Overall, this term is rarely used to index the collection and can cause confusion for indexing in a digital repository. The second is to automatically induce a taxonomy from refined subject terms to capture their semantic relations. Our research motivation is two-fold. First, expediting the reuse of existing subject terms has been highlighted as a key to maximising their value [8]. In some digital repositories, a controlled vocabulary of subject terms is often readily available. For instance, Analysis & Policy Observatory (APO),4 the largest open access repository for public policy and research literature in Australia, has already used a combination of some portion of subject terms drawn from a general-use controlled vocabulary, Faceted Application of Subject Terminology (FAST),5 and the subject terms that the repository curators6 have manually identified. However, as highlighted in [3], there often exists many irrelevant subject terms in the underlying repository. Further, in order to identify relevant subject terms from a general-use controlled vocabulary, the success of this task requires a high level of comprehensiveness of the underlying documents and the semantic coverage of the terms in such a vocabulary [9]. However, this task may not be obvious and thus can be very difficult to achieve. As the second facet of our motivation, inducing a taxonomy from subject terms has been relatively overlooked. This is a challenging problem, as these terms themselves do not contain explicit relationships from which a taxonomy can be constructed.

Most existing approaches have little focused on the challenging problems mentioned above. First, these approaches have mainly attempted to propose methods for creating subject terms (or keywords). Instead, our focus is to automatically refine an existing set of the subject terms by analysing their semantics. Second, the related works have mostly focused on automatically building a taxonomy of important terms from text documents [5,10]. However, we focus on inducing a subject term taxonomy from refined subject terms by analysing their mutual usages.

This paper makes three main contributions: First, we propose a method for refining an existing set of subject terms S that possibly contain irrelevant ones in a given document collection D (Section 3). Given S, our refinement process takes two steps: (1) for each document dD, we identify additional subject term candidates that may be relevant but not previously assigned to d. Such candidates are added to S (Section 3.1); and (2) we filter out irrelevant subject terms from S and merge insignificant subject terms with more significant ones based on their similarities to produce a more precise set of subject terms, i.e., called refined subject terms S, to improve their semantic coverage (Section 3.2). Second, we propose an approach for inducing a taxonomy from S by integrating their mutual usages for indexing documents and their semantics (Section 4). For this, we apply the subsumption method [11], with our proposed objective function that maximises the semantic relatedness of subject terms in the induce taxonomy. Third, we propose a case study using the APO repository to show the applicability of the proposed methodology (Section 5). Further, to show the generalisability of the proposed taxonomy inducing approach, we evaluate its effectiveness using MeSH,7 in comparison with the state–of-the-art taxonomy inducing method, TaxoFinder [5] (Section 6). Our case study and evaluation show that the proposed methodology has high potential to be used for refining an existing set of subject terms and capturing their semantic relationships by inducing their taxonomy.

This paper is organised as follows. Section 2 provides related works and background of the APO repository. Section 3 presents an overview of the proposed methodology and describes the process for refining subject terms. Section 4 discusses our approach for inducing a subject term taxonomy. Section 5 shows a case study of our proposed methodology using the APO repository. Section 6 presents our evaluation of our approach for inducing a subject term taxonomy to show its generalisability. Section 7 presents the conclusion of this paper.

Section snippets

Related work and background

In this section, first, we present how information is organised using subject terms and taxonomies in digital repositories. Then, we review research works that focused on creating or refining subject terms (or keywords) in the community of information and library science. Afterwards, we discuss related works on inducing a taxonomy from a knowledgebase or subject terms. Finally, we introduce the APO digital repository for our case study.

Methodology for refining subject terms

In this section, we present the details of our proposed methodology for refining subject terms, followed by the method for inducing a subject term taxonomy in Section 4. The overview of the methodology comprised of the three steps is depicted in Fig. 1. First, given the text documents D in a repository and existing set of subject terms S used to index D, we identify the subject terms that are potentially relevant but previously not assigned to related documents in D (Section 3.1). Our premise

Inducing subject term taxonomy

In Section 3, we have discussed the process for refining existing subject terms. The outcome of the primary subject terms are called refined subject terms S, where each one is associated with its synonyms. In the rest of this section, to simplify our presentation, subject terms are referred to S.

Although we have generated S, we may still have a difficulty for understanding and representing the semantic relatedness of terms in S. Specifically, there are three important questions to be

A case study: analysis of the APO repository

We conduct a case study, where we apply the proposed methodology to the APO repository. First, we analyse how Step 1 in Fig. 1 can identify missing subject terms. Second, we qualitatively measure the quality of the refined subject terms produced by Step 2 in Fig. 1 based on the APO curators. Finally, we present the result of a subject term taxonomy induced from the refined subject terms using Step 3 in Fig. 1. In Section 6, we further present an experiment to evaluate the effectiveness of our

Evaluation of inducing subject term taxonomy

We now quantitatively evaluate the effectiveness of the subsumption method (SS) using a publicly well-known dataset. This can improve our understanding of the generalisability of our approach for inducing a taxonomy. To show the relative effectiveness of SS, we also compare it with TaxoFinder. We assess the induced taxonomy Ti from each method (i.e., SS or TaxoFinder) by comparing it with the existing gold-standard taxonomy Tg. Our aim to induce a taxonomy Ti as much close as possible to Tg.

Conclusion

In this paper, we presented a methodology for refining an existing set of subject terms used to index a digital collection in a digital repository. The motivation of our work was to (1) additionally find potentially relevant subject terms that have been missed for indexing the collection, and (2) also generate a more precise, meaningful set of subject terms by refining the existing subject terms. Further, we presented the method for inducing a taxonomy from the refined subject terms, with our

CRediT authorship contribution statement

Yong-Bin Kang: Investigation, Conceptualization, Methodology, Software, Formal analysis, Writing - review & editing. Jihoon Woo: Software, Formal analysis, Writing - review & editing. Les Kneebone: Conceptualization, Methodology, Writing - review & editing. Timos Sellis: Project administration, Funding acquisition, Supervision, Investigation, Conceptualization, Writing - review & editing

Acknowledgements

This work was supported by the Linked semantic platforms for social & physical infrastructure & wellbeing project, LE180100094, funded by an Australian Research Council's Linkage Infrastructure, Equipment and Facilities (LIEF) grant. We especially thank Associate Professor Brigid van Wanrooy, Director of Analysis and Policy Observatory (APO), for providing insightful comments and support during the research project.

Yong-Bin Kang received a PhD in Information Technology from Monash University in 2011. Currently, he is a senior data science research fellow for the ARC Centre of Excellence for Automated Decision Making and Society. His research expertise and interests are mainly in the fields of natural language processing (NLP), machine learning, and data mining. He has experience in working, managing and delivering large industrial, multi-disciplinary research projects in data science such as patent

References (35)

  • R.F. Smallwood

    Information Governance: Concepts, Strategies and Best Practices

    (2019)
  • J. Greenberg

    Metadata capital: raising awareness, exploring a new concept

    Bull. Assoc. Inf. Sci. Technol.

    (2014)
  • R. Bennett et al.

    Assignfast: an autosuggest based tool for fast subject assignment

    Inf. Technol. Libr.

    (2014)
  • S. Huang et al.

    Improving taxonomic relation learning via incorporating relation descriptions into word embeddings

    Concurr. Comput.

    (2020)
  • H. Hedden

    Taxonomies and controlled vocabularies best practices for metadata

    J. Digital Asset Manag.

    (2010)
  • O. Medelyan

    Human-Competitive Automatic Topic Indexing

    (2009)
  • A. Kühnemund

    The role of applications within the reviewing service zbmath

    PAMM

    (2016)
  • Cited by (0)

    Yong-Bin Kang received a PhD in Information Technology from Monash University in 2011. Currently, he is a senior data science research fellow for the ARC Centre of Excellence for Automated Decision Making and Society. His research expertise and interests are mainly in the fields of natural language processing (NLP), machine learning, and data mining. He has experience in working, managing and delivering large industrial, multi-disciplinary research projects in data science such as patent analytics, clinical data analytics, scientific-article analytics, social-media data analytics, expert finding and matching, and machine learning algorithms and applications. His research has been demonstrated by publications in Information Systems journal, WIREs data mining and knowledge discovery, JMIR public health and surveillance, IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Cybernetics. Some representative conference papers A/A* include AAAI, ECAI, ISWC, CIKM, and K-CAP.

    Jihoon Woo received a bachelor's degree in data science from Swinburne University of Technology. He is currently working at Social Data Analytics Lab in Swinburne University of Technology as a research assistant. His research interests include machine learning, natural language processing, and knowledge graph.

    Les Kneebone is an Information Architect in Analysis and Policy Observatory in Australia. He has worked in information management roles in government, school, community and research sectors since 2002. He mainly contributed to managing metadata, taxonomies and cataloguing standards used in these sectors. Before graduating in Information Management at RMIT, He gained post-graduate qualifications in Sociology and the History and Philosophy of Science (Queensland University of Technology and and University of Melbourne).

    Timos Sellis is a Research Scientist at Facebook (USA) and an Adjunct Professor at Swinburne University of Technology (Australia), where between 2016 and 2020 he was a Professor and the Director of the Data Science Research Institute. He got his Diploma degree in Electrical Engineering in 1982 from the National Technical University of Athens (NTUA), Greece. In 1983 he got the M.Sc. degree from Harvard University (USA) and in 1986 the Ph.D. degree from the University of California at Berkeley (USA), both in Computer Science. He has been in the past a faculty member at the University of Maryland (USA, 1986-92), the National Technical University of Athens (Greece, 1992-2013), and RMIT University (Australia, 2013-2016). In 2018, he was awarded the IEEE TCDE Impact Award, in recognition of his impact in the field and for contributions to database systems research and broadening the reach of data engineering research. His research interests include big data, data streams, personalization, data integration, and spatio-temporal database systems . He is a fellow of the IEEE and ACM.

    1

    Work done while at Swinburne University of Technology.

    2

    The source code of the proposed methodology is available at https://github.com/Yongbinkang/SubjectTracker

    View full text