Semantic flow in language networks discriminates texts by genre and publication date

https://doi.org/10.1016/j.physa.2020.124895Get rights and content

Abstract

We propose a framework to characterize documents based on their semantic flow. The proposed framework encompasses a network-based model that connected sentences based on their semantic similarity. Semantic fields are detected using standard community detection methods. As the story unfolds, transitions between semantic fields are represented in Markov networks, which in turn are characterized via network motifs (subgraphs). Here we show that different book characteristics (such as genre and publication date) are discriminated by the adopted semantic flow representation. Remarkably, even without a systematic optimization of parameters, philosophy and investigative books were discriminated with an accuracy rate of 92.5%. While the objective of this study is not to create a text classification method, we believe that semantic flow features could be used in traditional network-based models of texts that capture only syntactical/stylistic information to improve the characterization of texts.

Introduction

In the last few years, several interesting findings have been reported by studies using network science to model language [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]. Network-based models have been used e.g. to address the authorship recognition problem, where the structure of the networks can provide valuable language-independent features. Other relevant applications relying on network science include the word sense disambiguation task [11], [12], the analysis of text veracity and complexity [13], [14]; and scientometric studies [15].

Whilst most of the network-based language research have been carried out at the word level [16], [17], only a limited amount of studies have been performed based on mesoscopic structures (sentences or paragraphs) [18]. In addition, most of the studies have analyzed language networks in a static way [19], [20]. In other words, once they are obtained, the order in which nodes (words, sentences, paragraphs) appear is disregarded. Here we probe the efficiency of sentence-based language networks in particular classification problems. Most importantly, differently from previous works hinging on network structure characterization [16], [17], we investigate whether the semantic flow along the narrative is an important feature for textual characterization in the considered classification tasks.

During the construction of a textual narrative, oftentimes authors follow a structured flow of ideas (introduction, narrative unfolding and conclusion). Even in books displaying a non-linear, complex narrative unfolding, one expects that an underlying linear semantic flow exists in authors’ mind. In other words, even though narrative events might not organize themselves in a trivial linear form, the linearity imposed by written texts requires some type of linearization (e.g. by performing a walk through the network). This idea is illustrated in Fig. 1.

The ideas conveyed by a text can be represented as a complex network, where nodes represent semantic blocks (e.g. sentences, paragraphs), and edges are established according to semantic similarities. To map such a conceptual network into a text, authors perform a linearization process, where nodes (concepts, ideas) are linearly chosen and then transformed into a linear narrative (see Fig. 1). Such a projection of a multidimensional space of ideas into a linear representation has been object of studies both on network theory and language research. A consequence of such a linearization in texts is the presence of long-range correlations at several linguistic levels, a property that has been extensively explored along the last years [22], [23], [24], [25].

While complex semantic networks have been used in previous works to represent the relationship between ideas and concepts, only a minor interest has been devoted to the analysis of how authors navigate the high-dimensional semantic relationships to generate a linear stream of words, sentences or paragraphs. In [26], a mesoscopic representation of networks was proposed. The authors used as a semantic, meaningful block a set of consecutive paragraphs. The semantic blocks were connected according to a lexical similarity index. The model aimed at combining a networked representation with an idea of semantic sequence obtained when reading a document. Even though some interesting patterns were found, the concept of semantic fields were not clear, as no semantic community structure arises from mesoscopic networks. The problem of linearization of a network structure was studied in [21]. A systematic analysis of the efficiency of several random walks in different topologies was probed. The efficiency was probed in a twofold manner: (i) the efficiency in transmitting the projected network; and (ii) the efficiency in recovering the original network. In [27], the authors explored the efficiency of navigating an idea space, by varying network topologies and exploration strategies.

In the current paper, we take the view that authors write documents by applying a linearization process to the original network of ideas, as shown in the procedure illustrated in Fig. 1. Upon analyzing the flow of ideas with the adopted network-based framework, we show that features extracted from the networks can be employed to characterize and classify texts. More specifically, we defined the network of ideas as a network of sentences linked by semantic similarity. Semantic fields of similar sentences (nodes) were identified via network community detection. These fields (network communities) were then used to characterize the dynamics of authors’ choices in moving from field to field as the story unfolds. Using a stochastic Markov model to represent the dynamics of choices of semantic fields performed by the author along the text, we showed, as a proof of principle, that the adopted representation can retrieve textual features including style (publication epoch) and complexity.

Section snippets

Research questions

The main objective is to answer the following research questions: is there any patterns of semantic flow in stories? Are these patterns related to textual characteristics? To address these questions, we use sentence networks to represent the semantic flow of ideas in texts. Such networks are summarized using a high-level representation based on the relationship between communities extracted from the sentence networks. Using this representation, we show that motifs extracted from such a

Materials and methods

This study can be divided in two parts. In the first step, we identify the semantic clusters (fields) of the story. Differently from the analysis of short texts, where semantic groups can be identified mostly by identifying paragraphs, in long texts – the focus of this study – the identification of semantic clusters is more challenging because semantic topics might not be organized in consecutive sentences/paragraphs owing to the linearization process illustrated in Fig. 1. In other words, the

Results and discussion

Here we probed whether the dynamics of changes in semantic groups in books can be used to characterize stories. The proposed methodology was applied in two distinct classification tasks. In the first task, we aimed at distinguishing three different thematic classes: (i) children books; (ii) investigative; and (iii) philosophy books. The second aimed at discriminating books according to their publication dates. All books (and their respective classes) were obtained from the Gutenberg repository.

Conclusion

In this paper we investigate whether patterns of semantic flow arises for different classes of texts. To represent the relationship between ideas in texts, we used a sentence network representation, where sentences (nodes) are connected based on their semantic similarity. Semantic clusters were identified via community detection and high-level representation of each book was created based on the transition between communities as the story unfolds. Finally, motifs were extracted to characterize

CRediT authorship contribution statement

Edilson A. Corrêa Jr.: Methodology, Software, Validation, Formal analysis, Investigation, Writing - original draft, Visualization. Vanessa Q. Marinho: Software, Validation, Investigation. Diego R. Amancio: Conceptualization, Methodology, Validation, Formal analysis, Writing - original draft, Writing - review & editing, Supervision, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

E.A.C. Jr. and D.R.A. acknowledge financial support from Google (Google Research Awards in Latin America grant). V.Q.M. acknowledges financial support from São Paulo Research Foundation (FAPESP) (Grant no. 15/05676-8). D.R.A. also thanks FAPESP (Grant no. 16/19069-9) and CNPq-Brazil (Grant no. 304026/2018-2) for support.

References (75)

  • KelloC.T. et al.

    Scaling laws in cognitive sciences

    Trends Cogn. Sci.

    (2010)
  • JinW. et al.

    Graph-based text representation and knowledge discovery

  • CanchoR.F. et al.

    Patterns in syntactic dependency networks

    Phys. Rev. E

    (2004)
  • MontemurroM.A. et al.

    Keywords and co-occurrence patterns in the voynich manuscript: An information-theoretic analysis

    PLoS One

    (2013)
  • BanK. et al.

    Robust clustering of languages across wikipedia growth

    R. Soc. Open Sci.

    (2017)
  • TadićB. et al.

    Algebraic topology of multi-brain connectivity networks reveals dissimilarity in functional patterns during spoken communications

    PLoS One

    (2016)
  • AmancioD.R. et al.

    Concentric network symmetry grasps authors’ styles in word adjacency networks

    Europhys. Lett.

    (2015)
  • StellaM. et al.

    Forma mentis networks map how nursing and engineering students enhance their mindsets about innovation and health during professional growth

    PeerJ Comput. Sci.

    (2020)
  • CastroN. et al.

    The multiplex structure of the mental lexicon influences picture naming in people with aphasia

    J. Complex Netw.

    (2019)
  • StellaM.

    Modelling early word acquisition through multiplex lexical networks and machine learning

    Big Data Cogn. Comput.

    (2019)
  • StellaM. et al.

    Forma mentis networks quantify crucial differences in STEM perception between students and experts

    PLoS One

    (2019)
  • AgirreE. et al.

    Personalizing pagerank for word sense disambiguation

  • AmancioD.R. et al.

    Unveiling the relationship between complex networks metrics and word senses

    Europhys. Lett.

    (2012)
  • AmancioD.R. et al.

    Identification of literary movements using complex networks to represent texts

    New J. Phys.

    (2012)
  • AmancioD.R. et al.

    Complex networks analysis of language complexity

    Europhys. Lett.

    (2012)
  • CanchoR.F. et al.

    The small world of human language

    Proc. R. Soc. Lond. Ser. B: Biol. Sci.

    (2001)
  • LiuH. et al.

    Language clustering with word co-occurrence networks based on parallel texts

    Chin. Sci. Bull.

    (2013)
  • ArrudaH.F. et al.

    Paragraph-based representation of texts: a complex networks approach

    Inf. Process. Manage.

    (2019)
  • ArrudaH.F. et al.

    Using complex networks for text classification: Discriminating informative and imaginative documents

    Europhys. Lett.

    (2016)
  • AmancioD.R.

    Probing the topological properties of complex networks modeling short written texts

    PLoS One

    (2015)
  • SchenkelA. et al.

    Long range correlation in human writings

    Fractals

    (1993)
  • Alvarez-LacalleE. et al.

    Hierarchical structures induce long-range dynamical correlations in written texts

    Proc. Natl. Acad. Sci.

    (2006)
  • AmitM. et al.

    Language and codification dependence of long-range correlations in texts

    Fractals

    (1994)
  • ArrudaH.F. et al.

    Representation of texts as complex networks: a mesoscopic approach

    J. Complex Netw.

    (2018)
  • ArrudaH.F. et al.

    Knowledge acquisition: A complex networks approach

    Inform. Sci.

    (2017)
  • BengioY. et al.

    A neural probabilistic language model

    J. Mach. Learn. Res.

    (2003)
  • CollobertR. et al.

    A unified architecture for natural language processing: Deep neural networks with multitask learning

  • Cited by (8)

    • A comparative analysis of knowledge acquisition performance in complex networks

      2021, Information Sciences
      Citation Excerpt :

      Network science has been employed to represent a great variety of complex systems [18,19,3,21,32,8]. In recent studies, complex networks have displayed the potential to represent the space of transitions between states for many types of systems [15,28,29]. In this context, the driving processes generating sequences are represented by stochastic walks of a variety of heuristics.

    View all citing articles on Scopus
    View full text