FAR-ASS: Fact-aware reinforced abstractive sentence summarization

https://doi.org/10.1016/j.ipm.2020.102478Get rights and content

Highlights

  • For natural language generation tasks, fact fabrication is a serious problem.

  • An automatic fact extraction scheme leveraging open information extraction and dependency parser tools to extract the structured fact tuples.

  • A factual correctness score function that takes into account the factual accuracy and the factual redundancy.

  • A framework that improves the informativeness and the factual correctness by jointly optimize a mixed-objective learning function via reinforcement learning.

Abstract

Automatic summarization systems provide an effective solution to today's unprecedented growth of textual data. For real-world tasks, such as data mining and information retrieval, the factual correctness of generated summary is critical. However, existing models usually focus on improving the informativeness rather than optimizing factual correctness. In this work, we present a Fact-Aware Reinforced Abstractive Sentence Summarization framework to improve the factual correctness of neural abstractive summarization models, denoted as FAR-ASS. Specifically, we develop an automatic fact extraction scheme leveraging OpenIE (Open Information Extraction) and dependency parser tools to extract structured fact tuples. Then, to quantitatively evaluate the factual correctness, we define a factual correctness score function that considers the factual accuracy and factual redundancy. We further propose to adopt reinforcement learning to improve readability and factual correctness by jointly optimizing a mixed-objective learning function. We use the English Gigaword and DUC 2004 datasets to evaluate our model. Experimental results show that compared with competitive models, our model significantly improves the factual correctness and readability of generated summaries, and also reduces duplicates while improving the informativeness.

Introduction

With the unprecedented growth of textual information on the Internet, how to mine useful knowledge from a large amount of redundant information efficiently is a great challenge (Dybala et al., 2017; Nowakowski et al., 2019; Rzepka, Takishita, & Araki, 2020), which has necessitated the development of highly efficient automatic summarization systems (Barros et al., 2019; Gambhir & Gupta, 2017; Mohamed & Oussalah, 2019). The most essential purpose of a summarization system is to generate a concise, readable and factual summary of the input text while keeping its gist (Dong et al., 2018; Jadhav & Rajan, 2018; Li et al., 2018). At present, there are two main types of summarization systems: extractive (Dong et al., 2018; Jadhav & Rajan, 2018; Zhang et al., 2018) and abstractive (Chen et al., 2016; Deng et al., 2020; Takase et al., 2016; Zheng et al., 2020). Extractive systems directly copy a few significant keywords from the source text to form a summary, which is actually a simple compression of the source sentences. Abstractive systems can automatically generate new words and linguistic phrases that are not present in the input sentences. Compared with extractive methods, abstractive summarization is considered much closer to the way humans make a summary, but it also causes more challenges, such as poor readability and factual discrepancy (Li et al., 2020; Zhang et al., 2020).

In this paper, we focus on the task of abstractive sentence summarization, which generates a shorter sentence while maintaining the original meaning of the input sentences. Unlike document-level summarization, the original text of the sentence-level summarization task is shorter, so it is impossible to directly extract the existing sentences to form a summary. Recently, neural network models based on the encoder-decoder architecture have demonstrated powerful capabilities in the sentence summarization task. These models can generate summaries with very high ROUGE scores (Cao et al., 2017). However, sentence summarization inevitably needs to tailor, modify, reorganize and fuse the input sentences. Therefore, the generated sentences often do not match the original relations, resulting in factual errors. Several researchers have conducted research on the factual consistency in summaries (Falke, Ribeiro, Utama, Dagan, & Gurevych, 2019; Goodrich et al., 2019; Kryściński, McCann, Xiong, & Socher, 2019), and they have found that nearly 30% of summaries generated using abstractive models contain fake facts.

In fact, for downstream tasks, such as data mining and information retrieval, the generated abstractive summaries with excessive factual errors are almost useless in practice. However, previous researchers (Mehta & Majumder, 2018; Paulus, Xiong, & Socher, 2018) have focused on optimizing models to improve the informativeness of generated summaries, which leads to a high ROUGE score, but some facts are discrepant with the original text. As shown in Fig. 1, the seq2seq-baseline model (Nallapati et al., 2016) and the PG (pointer-generator) network (See, Liu, & Manning, 2017) produce the same fake fact, i.e., the subject of the verb “build” is “intel” instead of “vietnamese government”, which results in an entirely different fact from the original text. Consequently, although the summaries are highly informative (ROUGE-L  = 0.49) and readable, they are useless due to being discrepant with the original facts.

Intuitively, for NLG (natural language generation) tasks, fact fabrication is a serious problem, which directly determines the usability of generated text. Nevertheless, existing abstractive summarization models rarely pay attention to improving the factual correctness of generated summaries. Some sporadic attempts have limited success. For example, in 2017, Cao et al. (2017) used the OpenIE (Open Information Extraction) (Angeli, Premkumar, & Manning, 2015) systems to extract the fact descriptions of the original input and then encoded them into the attention mechanism together with the original input. In 2019, Falke et al. (2019) used natural language inference systems to evaluate the factual consistency of generated summaries for the first time. On this basis, they reranked the summaries. In 2019, Kryściński et al. (2019) proposed a weakly supervised model to evaluate the factual correctness of generated summaries.

In this work, our goal is to optimize the factual correctness of existing neural abstractive summarization models. In order to maintain the factual consistency between the generated text and the original input, we must first extract fact descriptions. To this end, we take advantage of popular tools OpenIE and dependency parser. OpenIE represents a fact as a relation triple consisting of (subject; predicate; object). But, for different sentences, complete relation triples are not always available. Therefore, we utilize the dependency parser to mine suitable relation tuples to further expand the facts. On this basis, we design a fact extraction scheme that can extract complete structured relation tuples from text to describe the facts. Then, we define a factual correctness score by comparing the relation tuples between the original text and generated summary. Furthermore, we also develop a mixed-objective learning function by linearly combining a factual correctness objective, a textual overlap objective, and a language model objective. Finally, we utilize the reinforcement learning (RL) strategy to jointly optimizing them.

Our contributions are as follows:

  • A fact extraction scheme. First, we utilize the popular OpenIE tool to dig out complete relation triples. Then, we use a dependency parser to extract suitable relation tuples to further expand the facts. We generate a complete structured set of fact descriptions by filtering, cleaning, and deduplicating the extracted tuples.

  • An evaluation function. We design a scoring function to describe the factual correctness of generated summaries quantitatively. In this work, we consider the factual accuracy and factual redundancy of generated summaries and systematically quantify their factual correctness in the open domain.

  • A reinforcement learning framework. We propose a complete framework and a training strategy for abstractive sentence summarization models to improve the informativeness and factual correctness by jointly optimizing a mixed-objective learning function via RL.

  • Extensive experiments. We conduct extensive experiments on the English Gigaword, Google News, and DUC 2004 datasets, proving that our model remarkably improves the factual correctness of generated summaries compared with competitive methods and also reduces duplicates while enhancing the informativeness.

Section snippets

Neural abstractive summarization models

The seq2seq is one of the mainstream frameworks in generating abstractive summaries. In 2015, Rush, Chopra, and Weston (2015) proposed a Convolutional Neural Network (CNN) encoder and a neural network language model under the seq2seq framework, which was the first application of the seq2seq model to the abstractive sentence summarization task. After that, Zhou et al. (2017) and Chopra, Auli, and Rush (2016) further improved the RNN-based summarization model. In 2016, Gu et al. (2016) added a

Background

In this section, we introduce our baseline pointer-generator network and fact extraction scheme. The pointer-generator network is an extension of the seq2seq-baseline model, which adds a copy mechanism to the original network structure by directly copying words from the original text into the proper positions in the generated summaries. We utilize the popular OpenIE and dependency parser tools to mine the fact descriptions in the input and generated summaries. We generate a complete set of

Fact-aware reinforced neural summarization

As shown in Fig. 4, our model is mainly composed of three parts. The blue box represents the neural summarization model. In our experiments, we use the seq2seq-baseline and PG in Section 3.1 respectively as the summarization models, to demonstrate the effectiveness of our approach. The green part is the fact extractor, which utilizes our fact extraction scheme in Section 3.2 to extract the fact tuples from the input text and generated summaries. The yellow part is policy learning. In order to

Experiments

In this section, we introduce our experimental datasets, the main evaluation metrics, the implementation details, and the comparative methods.

Results

In this section, we prove that our model significantly performs better than the competitive methods. We first present the results of informativeness and factual correctness evaluation. Then, we also perform a manual evaluation on 100 random samples to ensure that our increases in ROUGE and factual F1 scores are also followed by enhancements in human readability and quality.

Conclusion and future work

In this paper, we focus on the task of abstractive sentence summarization. We present a general framework and a hybrid learning strategy to improve the factual correctness of neural abstractive summarization models. We employ the popular OpenIE and dependency parser tools to extract structured fact tuples. In order to evaluate the factual correctness of generated summaries quantitatively, we define a factual correctness score function that considers the factual accuracy and factual redundancy.

CRediT authorship contribution statement

Mengli Zhang: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing - original draft, Writing - review & editing, Visualization. Gang Zhou: Conceptualization, Writing - review & editing, Supervision. Wanting Yu: Conceptualization, Writing - review & editing, Supervision. Wenfen Liu: Conceptualization, Writing - review & editing, Supervision.

Declaration of Competing Interest

All authors declare that they have no conflict of interest.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (61862011), Guangxi Science and Technology Foundation (2018GXNSFAA138116, 2019GXNSFGA245004).

References (46)

  • Y.C. Chen et al.

    Fast abstractive summarization with reinforce-selected sentence rewriting

  • S. Chopra et al.

    Abstractive sentence summarization with attentive recurrent neural networks

  • J. Clarke et al.

    Global inference for sentence compression: An integer linear programming approach

    Journal of Artificial Intelligence Research

    (2008)
  • T. Cohn et al.

    Sentence compression beyond word deletion

  • Z. Deng et al.

    A two-stage Chinese text summarization algorithm using keyword information and adversarial learning

    Neurocomputing

    (2020)
  • Y. Dong et al.

    BanditSum: Extractive summarization as a contextual bandit

  • P. Dybala et al.

    Towards joking, humor sense equipped and emotion aware conversational Systems

    Advances in Affective and Pleasurable Design

    (2017)
  • T. Falke et al.

    Ranking generated summaries by correctness: An interesting but challenging application for natural language inference

  • K. Filippova et al.

    Sentence compression by deletion with LSTMs

  • K. Filippova et al.

    Overcoming the lack of parallel data in sentence compression

  • M. Gambhir et al.

    Recent automatic text summarization techniques: A survey

    Artifcial Intelligence Review

    (2017)
  • B. Goodrich et al.

    Assessing the factual accuracy of generated text

  • J. Gu et al.

    Incorporating copying mechanism in sequence-to-sequence learning

  • Cited by (30)

    • WikiDes: A Wikipedia-based dataset for generating short descriptions from paragraphs

      2023, Information Fusion
      Citation Excerpt :

      Table 15 shows a few false descriptions and their errors, which are highlighted in orange with corresponding explanations. Although our research did not design any mechanism to control the repetitive texts and factual information, these problems were addressed by some approaches such as Pointer–Generator Networks [39], Global Encoding [108], reinforcement learning [109], rule-based/heuristic transformations [110,111], and graph attention [112]. In this paper, we introduced WikiDes, a novel summarization dataset with over 80k samples on 6987 topics created by collecting data from Wikipedia and Wikidata.

    View all citing articles on Scopus
    View full text