VoCSK: Verb-oriented commonsense knowledge mining with taxonomy-guided induction☆
Introduction
Commonsense is the inherent background knowledge of humans in the cognitive process [49]. Although current intelligent systems surpass humans in many tasks such as reading comprehension [23] and machine translation [18], intelligent machines still lag behind humans in performing simple tasks. For example, given the sentence “The trophy would not fit in the brown suitcase because it was too big.” Levesque et al. [26], it is difficult for machines to accurately determine whether “it” in the text refers to “trophy” or “suitcase”. Instead, this problem is very easy for humans since they possess a great deal of commonsense knowledge (CSK) and reasoning ability. Unfortunately, modern machines still lack massive CSK. Thus, it is crucial to endow machines with CSK.
Among various kinds of CSK, verb-oriented CSK is especially important for machines to achieve human-level AI. Verbs, in general, are crucial for the understanding of natural language and thus are widely applicable in NLP tasks such as semantic role labeling [16], word sense disambiguation [11], and query understanding [52]. For example, given a query watch harry potter, the information retrieval (IR) system can understand that harry potter is a movie or a DVD instead of a book through the verb watch. The most fundamental CSK about a specific verb (e.g., eat) is what kind of subjects (e.g., person) will act on what kind of objects (e.g., food). One distinguishing characteristic of verb-oriented CSK is that it is only meaningful when expressed at the concept level. It is because verb-oriented knowledge at the instance level is so specific that it can only be factual knowledge. For example, “John eats bread” is just a trivial fact and cannot be considered as CSK since John and bread can be replaced by other specific words, such as Helen and apple. Humans exhibit intelligence not only because we can understand the meaning of trivial facts such as “John eats bread”, but also because we can understand that of “person eats food”. This means even given a new person (e.g., Wilbur) and a strange fruit (e.g., plantain), we also understand what “Wilbur eats plantain” means.
In this paper, we focus on the automatic acquisition of implicit verb-oriented CSK, which is the concept-level knowledge of verb phrases (VPs) consisting of a subject, verb, and object. Specifically, our task accepts instance-level VPs (e.g., “John eats bread”, “Helen eats apple”, etc.) as the input and outputs concept-level assertions (e.g., “person eats food”). VPs at the instance level often emphasize the relation between a specific subject and object, while those at the concept level describe the relation between concepts representing the common characteristics of a group of instances.
We argue that VPs at the concept level are rarely covered in existing verb-oriented knowledge bases (KBs). There are two reasons to support this argument. First, the concepts we mine for each verb are neither too abstract nor too specific (see Section 4.2 for details), while the thematic roles (e.g., Agent and Cause) in existing verb-oriented KBs, such as VerbNet [27], FrameNet [2], and PropBank [36], are often too abstract to differentiate verbs' semantics. For example, PropBank defines some general thematic roles that can be applied to any verb. Second, the coverage of the thematic roles for verbs is often limited since it is hard to manually identify a large amount of high-quality thematic roles. For example, only 23 pre-defined thematic roles are listed in VerbNet. In contrast, the concepts in a probabilistic taxonomy that we use in this paper are in the millions. Hence, it is necessary to mine VPs at the concept level to complement existing verb-oriented KBs. To more clearly distinguish our mined CSK from the knowledge in existing verb-oriented KBs, some examples from these KBs are given in Table 1.
It is difficult for most existing work to automatically acquire verb-oriented CSK since their methods are manual-based or data-driven. Manual-based methods resort to knowledge engineers or volunteers to obtain CSK by hand [30]. Many traditional commonsense KBs are built in this way, such as WordNet [34], Cyc [25], and ConceptNet [30]. Although hand-crafted CSK is of high quality, its coverage is limited for two reasons. First, manual-based methods are time-consuming and labor-intensive, hurting the recall of CSK acquisition. Second, CSK obtained by these methods is often not enough since only the CSK that people can think of can be obtained. To improve the recall, data-driven approaches are proposed to automatically collect CSK from large corpora. Typical efforts along this line include the discovery of CSK inference rules from the text [29], [5], the extraction of concept attributes from query logs [38], the construction of extensible ontology by linking Wikipedia and WordNet [14], and the mining of specific relations (e.g., Comparative [47] and Part-Whole [48] relation) from the Web. However, most of these methods focus on acquiring knowledge directly from corpora and thus are limited to extracting explicit CSK, without the ability to harvest implicit knowledge (e.g., “person eats food”) that rarely appears in corpora. Furthermore, the coverage of commonsense KBs is upper-bounded by the number of the pre-defined relations. For example, there are only about 10 relations in WordNet and 50 relations in ConceptNet 5.7.
Hence, in this paper, we propose an induction-based approach to automatically acquire verb-oriented CSK from VPs with the help of a large-scale probabilistic taxonomy. The core idea is to conceptualize the subjects and objects in VPs with isA relations in a probabilistic taxonomy. For example, “person eats food” can be induced by conceptualizing the VPs “Helen eats apple”, “John eats bread”, and “Michael eats egg” with isA(Helen/John/Michael, person) and isA(apple/bread/egg, food). There are two reasons why verb-oriented CSK can be obtained through the induction-based approach. On the one hand, great efforts have been devoted to constructing Web-scale probabilistic taxonomies, and most of them (e.g., Probase [54]) are available. On the other hand, with the rapid development of the Web, VPs at the instance level are widely available in the Web corpora, e.g., Google Syntactic N-Grams.1 Hence, such abundant instance VPs and large-scale probabilistic taxonomies enable the induction method to harvest rich conceptual CSK.
However, it is not easy to mine high-quality verb-oriented CSK, and several challenges need to be solved:
- •
How to determine which VPs can be induced as verb-oriented CSK. In fact, we need to carefully select VPs that are easily conceptualized as CSK since some VPs are already abstract enough. For example, the VP “man eats fruit” is abstract enough to be viewed as the verb-oriented CSK.
- •
How to select a concept with appropriate granularity for an instance. In general, an instance always has thousands of concepts in a large-scale probabilistic taxonomy. Some concepts are overly specific, and others are overly abstract. It is difficult to find a good trade-off between specificity and abstraction of the concept for a given instance. For example, given an instance bread, it could be conceptualized as staple food, food, or object. However, only food is appropriate since staple food is too specific and object is too abstract.
- •
How to measure the semantic plausibility of the induced verb-oriented CSK. For example, although name is an appropriate concept for the instance John, the candidate “name eats food” is implausible.
To solve the above challenges, we design two modules in this paper. The first module is an entropy-based metric used to solve the first challenge. This metric is employed to measure the abstractness of the subject and object in each VP. The second module is a verb-oriented CSK generator used to solve the second and third challenges. The generator is realized as a joint optimization model based on the minimum description length (MDL) principle and a neural language model (NLM). MDL is used to select an appropriate concept for an instance, and NLM is employed to evaluate the plausibility of the candidate VPs at the concept level. Furthermore, we introduce two strategies to accelerate the computation of the objective function, including the simulated annealing-based approximate solution and the verb phrases clustering method.
Contributions. The contributions of this paper are summarized as follows:
- •
To the best of our knowledge, we are the first to acquire implicit verb-oriented CSK automatically. The most significant characteristics of the target CSK are at the concept level and rarely explicitly stated in corpora.
- •
We propose a joint optimization model based on the MDL principle and NLM to generate high-quality verb-oriented CSK. Besides, we also propose an entropy-based metric to identify noisy input VPs.
- •
We conduct extensive experiments on real-world datasets, and the results prove the effectiveness of our approach. Finally, we harvest 259 verbs and 18,406 verb-oriented CSK to form a commonsense KB called VoCSK. To verify the usefulness of this KB, we utilize the knowledge in VoCSK to improve the model performance on two real-world tasks, including context-aware conceptualization and commonsense question answering.
The rest of this paper is organized as follows. Section 2 discusses the related work of this paper. Section 3 gives the background of probabilistic taxonomies. Section 4 formulates the problem and briefly introduces our solution for verb-oriented CSK acquisition. Section 5 describes an entropy-based metric to identify noisy input verb triplets. Section 6 details the generation of verb-oriented CSK with a joint optimization model. Section 7 gives two strategies to speed up the computation of the objective function. The experiments are reported in Section 8, and our conclusion and future work are given in Section 9.
This paper is extended from our previous work [31] to provide a more comprehensive analysis. First, we add the related work (Section 2) and preliminary (Section 3). Second, we add more details on using an NLM to measure the plausibility of candidate commonsense triplets (Section 6.2) and introduce two strategies to accelerate the computation of the objective function (Sections 7.1 and 7.2). Third, we also analyze the complexity and feasibility of our method (Section 7.3). Fourth, to verify that most of the knowledge in VoCSK rarely appears in existing KBs and corpora, we add two matching experiments between 1) VoCSK and existing KBs as well as 2) VoCSK and corpora (Section 8.2). Fifth, we add extensive comparison experiments to illustrate the rationality of the hyper-parameter settings in this paper (Section 8.3). Sixth, we add the comparison experiments of the entropy-based triplet filter, verb-oriented CSK generator, and optimization algorithms to evaluate the effectiveness of our methods (Section 8.4). Last, and most important, we add two downstream applications to assess the usefulness of our mined VoCSK (Sections 8.5 and 8.6).
Section snippets
Related work
Related work in this paper can be divided into three groups: commonsense acquisition, conceptualization, and other topics.
Commonsense Acquisition. The CSK acquisition has attracted a great deal of research interest. These methods can be divided into two categories. First, knowledge engineers and volunteers were asked to collect CSK manually. For example, in Cyc [25], commonsense facts were crafted by human experts using the CycL representation language. A lexical commonsense KB called WordNet
Preliminary
In this section, we describe the background of a large-scale probabilistic taxonomy. Based on it, two definitions are further given. Some typical notations used in this paper are shown in Table 2.
We use a large-scale probabilistic taxonomy (e.g., Probase [54]) to provide massive fine-grained concepts for subjects and objects in VPs. The taxonomy is a large semantic network that consists of isA relations between terms. For example, google isA company where google is the hyponym of company. The
Overview
In this section, we first formalize the problem and then outline our solution for verb-oriented CSK generation. The CSK generation of different verbs is independent of each other. Hence, we analyze each verb and its phrases separately. In the following paragraphs, we discuss our solution for a given verb.
Entropy-based triplet filter
In this section, we detail the entropy-based triplet filter with the help of a probabilistic taxonomy. As we mentioned above, we identify the noisy verb triplet by measuring the abstractness of its subject and object. According to our observation, the specific terms (subjects or objects) tend to be positioned at the lower level, while abstract terms are usually located at the higher level in a probabilistic taxonomy, as shown in Fig. 2. In this paper, the level of leaf nodes (i.e., the most
Verb-oriented CSK generator
In this section, we elaborate on our generator for acquiring verb-oriented CSK. First, the MDL principle is used to select appropriate concepts for a bag-of-subjects (objects). An NLM is then employed to measure the plausibility of candidate commonsense triplets. Finally, based on the MDL principle and NLM, a joint model is proposed to generate verb-oriented CSK for a given set of the remaining verb triplets after filtering.
Optimization algorithms
Unfortunately, the exhaustive enumeration of concepts in optimizing objective Eq. (16) is costly since an instance in a probabilistic taxonomy always has thousands of concepts. Besides, unrelated triplets are difficult to be conceptualized as appropriate verb-oriented CSK. For example, given the triplets and , it is hard to acquire the appropriate concepts of subjects and objects, respectively. To solve these problems, we propose two strategies to speed up the
Experiments
In this section, we first report the statistics of the constructed VoCSK and evaluate whether the knowledge in VoCSK rarely appears in corpora and existing commonsense KBs. Then, we conduct extensive experiments to analyze the hyper-parameters in our methods and evaluate the effectiveness of the methods. We finally use the mined verb-oriented CSK to enhance the model performance on two downstream applications.
Conclusion and discussion
In this paper, we focus on the automatic acquisition of a typical kind of implicit verb-oriented CSK that rarely appears in corpora. To this end, we propose a taxonomy-guided induction approach to mine CSK from verb phrases with the help of a probability taxonomy. Specifically, we design two modules to achieve this purpose. The first is an entropy-based metric to identify the noisy input phrases. The second is a joint model based on the MDL principle and an NLM to generate verb-oriented
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (57)
Modeling by shortest data description
Automatica
(1978)- et al.
Efficient string matching: an aid to bibliographic search
Commun. ACM
(1975) - et al.
The Berkeley framenet project
- et al.
The minimum description length principle in coding and modeling
IEEE Trans. Inf. Theory
(1998) - et al.
A neural probabilistic language model
J. Mach. Learn. Res.
(2003) - et al.
Global learning of typed entailment rules
- et al.
Freebase: a collaboratively created graph database for structuring human knowledge
- et al.
Contextual text understanding in distributional semantic space
- et al.
Elements of Information Theory
(2012) - et al.
A Pólya urn document language model for improved information retrieval
ACM Trans. Inf. Syst.
(2015)
Ultra-Fine Entity Typing with Weak Supervision from a Masked Language Model, 1790–1799
Investigations into the Role of Lexical Semantics in Word Sense Disambiguation
BERT: pre-training of deep bidirectional transformers for language understanding
A density-based algorithm for discovering clusters in large spatial databases with noise
YAGO: a core of semantic knowledge unifying WordNet and Wikipedia
Measuring nominal scale agreement among many raters
Psychol. Bull.
Automatic labeling of semantic roles
Comput. Linguist.
Taxonomy induction using hypernym subsequences
Achieving human parity on automatic Chinese to English news translation
Short text understanding through lexical-semantic analysis
Autoname: a corpus-based set naming framework
Optimization by simulated annealing
Science
Pagerank without hyperlinks: structural reranking using links induced by language models
ACM Trans. Inf. Syst.
Albert: a lite bert for self-supervised learning of language representations
Dbpedia–a large-scale, multilingual knowledge base extracted from Wikipedia
Semant. Web
Cyc: a large-scale investment in knowledge infrastructure
Commun. ACM
The winograd schema challenge
English Verb Classes and AlternationA Preliminary Investigation
Cited by (0)
- ☆
This work was supported by National Key Research and Development Project (No. 2020AAA0109302), Shanghai Science and Technology Innovation Action Plan (No. 19511120400) and Shanghai Municipal Science and Technology Major Project (No. 2021SHZDZX0103).