1 Introduction

Exchangeable relational data (Nowicki and Snijders, 2001; Ishiguro et al., 2010; Schmidt and Mørup, 2013), such as tensor data (Zhang, 2019; Pensky, 2019) and collaborative filtering data (Luo et al., 2016; Zhang et al., 2019; Li et al., 2009), are commonly observed in many real-world applications. In general, exchangeable relational data describe the relationship between two or more nodes (e.g. friendship linkages in social networks; user-item rating matrices in recommendation systems; and protein-to-protein interactions in computational biology), where exchangeability refers to the phenomenon that the joint distribution over all observed relations remains invariant under node permutations. Techniques for modelling exchangeable relational data include node partitioning to form “homogeneous blocks” (Nowicki and Snijders, 2001; Kemp et al., 2006; Roy and Teh, 2009), graph embedding methods to generate low-dimensional representations (Pang et al., 2013; Bouzas et al., 2015; Dutta and Sahbi, 2019), and optimization strategies to minimize prediction errors (Nikolaidis et al., 2012; Wang et al., 2019).

Graphon theory (Orbanz, 2009; Orbanz and Roy, 2014; Lloyd et al., 2012) has recently been proposed as a unified theoretical framework for modelling exchangeable relational data. In graphon theory, each relation from a node i to another node j is represented by an intensity value generated by a graphon function, which maps from the corresponding coordinates of the node pair in a unit square, \((u_{i}^{(1)}, u_{j}^{(2)})\), to an intensity value in a unit interval. Many existing Bayesian methods for modelling exchangeable relational data can be described using graphon theory with various graphon functions. Fig. 1 illustrates several typical graphon functions, including the Stochastic Block Model (SBM) (Nowicki and Snijders, 2001; Kemp et al., 2006), the Mondrian Process Relational Model (MP-RM) (Roy and Teh 2009), the Rectangular Tiling Process Relational Model (RTP-RM) (Nakano et al., 2014), and the Gaussian Process Prior Relational Model (GP-RM) (Orbanz 2009).

The simplest of these graphon functions is the regular-grid piecewise-constant graphon (Fig. 1, left). Generally, it is constructed from two-independent partition processes in a two-dimensional space. The resulting orthogonal crossover between these dimensions produces a regular grid partition in the space. Typical regular-grid partition models include the SBM (Nowicki and Snijders, 2001) and its infinite states variant, the Infinite Relational Model (IRM) (Kemp et al., 2006). The SBM uses a Dirichlet distribution (or Dirichlet process for the IRM) to independently generate a finite (or infinite for the IRM) number of segments in each dimension.

The Mondrian process relational model (MP-RM; Fig. 1, centre-left) (Roy et al., 2007; Roy and Teh, 2009; Roy, 2011) is a representative model which generates k-d tree-structured piecewise-constant graphons. In general, the Mondrian process recursively generates axis-aligned cuts in the unit square and partitions the space in a hierarchical fashion known as a k-d tree. The tree-structure is regulated by attaching an exponentially distributed cost to each axis-aligned cut, so that the tree-generation process terminates when the accumulated cost exceeds a budget value. The Binary Space Partitioning-Tree process relational model (BSP-RM) (Fan et al., 2018, 2019) also generates tree-structured partitions. The difference between the BSP-RM and the MP-RM is that the BSP-RM uses two dimensions to form oblique cuts and thus generate convex polyhedron-shaped blocks. These oblique cuts concentrate more on describing the inter-dimensional dependency and can produce more efficient space partitions.

The Rectangular Tiling process relational model (RTP-RM; Fig. 1, centre-right) (Nakano et al., 2014) produces a flat partition structure on a two-dimensional array by assigning each entry to an existing block or a new block in sequence, without violating the rectangular restriction of the blocks. By relaxing the restrictions of the hierarchical or regular-grid structure, the RTP-RM aims to provide more flexibility in block generation. However, the process of generating blocks is quite complicated for practical use. As a result, while the hierarchical and regular-grid partition models can be used for continuous space and multi-dimensional arrays (after trivial modifications), the RTP-RM is restricted to (discrete) arrays only.

The Rectangular Bounding process relational model (RBP-RM) (Fan et al., 2018) uses a bounding strategy to generate rectangular blocks in the space. In contrast to the previously described cutting strategies, the RBP-RM concentrates more on the important regions of the space and avoids over-modelling sparse and noisy regions. In the RBP-RM, the number of possible intensities is equivalent to the number of blocks, which follows a Poisson distribution and is almost certainly finite.

The Gaussian process relational model (GP-RM; Fig. 1, right) (Lloyd et al., 2012) uses a prior over a random function in the unit square to form a continuous graphon. In this way it can potentially generate desired continuous intensity values via the graphon function. However, the computational cost of the GP-RM is the same as that of the Gaussian process, which scales to the cubic of the number of nodes (n).

Fig. 1
figure 1

Visualisation of Bayesian graphon-construction methods for modelling exchangeable relational data. From left to right: the Stochastic Block Model (SBM); the Mondrian Process Relational Model (MP-RM); the Rectangular Tiling Process Relational Model (RTP-RM) and the Gaussian Process Prior Relational Model (GP-RM). For any pair of node coordinates \((u_{i}^{(1)}, u_{j}^{(2)})\) the relation intensity is mapped from the unit square to a unit interval using a graphon function (denoted \(g_1,\ldots ,g_4\)), where a darker colour observed in the unit square represents a higher mapped intensity in the unit interval

These existing models can be broadly classified into two categories. The first category, which includes the SBM, MP-RM, RTP-RM and RBP-RM models, uses node-partitioning strategies to construct the relational model. By partitioning the set of nodes into groups along node co-ordinate margins, blocks can be constructed from these marginal groups that partition the full-dimensional co-ordinate space according to a given construction method (Fig. 1). These models then assume that the relation intensity for node pairs is constant within each block. That is, the graphon function that generates intensity values over node co-ordinate space is constructed in a piecewise-constant manner. However, such piecewise-constant graphons can only provide limited modelling flexibility with a fixed and constant number of intensity values (i.e. equivalent to the number of blocks). As a result, they are restricted in their ability to model the ground-truth well. The second category of relational models, which includes the GP-RM, aims to address this limitation as the graphon function can provide continuous intensity values. However, the computational complexity for estimating this graphon function is proportional to the cubic of the number of nodes, which makes it practically non-viable for medium or large sized datasets.

In this paper, we propose to apply a smoothing procedure to piecewise-constant graphons to form smoothing graphons, which will naturally permit continuous intensity values for describing relations without impractically increasing computational costs. As the Stochastic Block Model is one of the most popular Bayesian methods for modelling exchangeable relational data, we focus on developing smoothing strategies within the piecewise-constant SBM graphon framework. In particular, we develop two variant smoothing strategies for the SBM: the Integrated Smoothing Graphon (ISG) and the Latent Feature Smoothing Graphon (LFSG).

  • ISG: In contrast to existing piecewise-constant graphons, which determine the intensity value based only on the block within which a node pair resides, the ISG alternatively calculates a mixture intensity for each pair of nodes by taking into account the intensities of all other blocks. The resulting mixture graphon function is constructed so that its output values are continuous.

  • LFSG: This strategy introduces auxiliary pairwise hidden labels to decompose the calculation of the mixture intensity used in the ISG, in order to enable efficient inference. In addition, the introduction of these labels allows each node to belong to multiple groups in each dimension (e.g. a user might interact with different people by playing different roles in a social network), which provides more modelling flexibility compared with the ISG (and existing piecewise-graphons) where each node is assigned to one group only.

Note that while we develop the ISG and LFSG for SBM-based graphons, our smoothing approach can be applied easily to other piecewise-constant graphons. The main contributions of our work are summarised as follows:

  • We identify the key limitation of existing piecewise-constant graphons and develop a smoothing strategy to flexibly generate continuous graphon intensity values, which might better reflect the reality of a process.

  • We develop the ISG smoothing strategy for the SBM to demonstrate how piecewise-constant graphons can be converted into smoothing graphons.

  • We improve on the ISG by devising the LFSG, which achieves the same objective of generating continuous intensity values but without sacrificing computation efficiency. Compared with the ISG where each node belongs to only one group, the LFSG allows each node to belong to multiple groups (e.g. so that the node plays different roles in different relations), and thereby also providing a probabilistic interpretation of node groups.

  • We evaluate the performance of our methods on the task of link prediction by comparing with the SBM and other benchmark methods. The experimental results clearly show that the smoothing graphons can achieve significant performance improvement over piecewise-constant graphons.

2 Preliminaries

2.1 Graphon theory

The Aldous–Hoover theorem (Hoover, 1979; Aldous, 1981) provides the theoretical foundation for modelling exchangeable multi-dimensional arrays (i.e. exchangeable relational data) conditioned on a stochastic partition model. A random 2-dimensional array is called separately exchangeable if its distribution is invariant under separate permutations of rows and columns.

Theorem 1

 (Orbanz and Roy, 2014; Lloyd et al., 2012): A random array \((R_{ij})\) is separately exchangeable if and only if it can be represented as follows: there exists a random measurable function \(F:[0,1]^3 \mapsto {\{0,1\}}\) such that \((R_{ij}) \overset{d}{=} \left( F(u_i^{(1)}, u_j^{(2)}, \nu _{ij})\right)\), where \(\{u_i^{(1)}\}_i, \{u_j^{(2)}\}_j\) and \(\{\nu _{ij}\}_{i,j}\) are two sequences and an array of i.i.d. uniform random variables in [0, 1], respectively.

Many existing Bayesian methods for modelling exchangeable relational data can be represented as in Theorem 1, using specific forms of the mapping function F. For instance, as illustrated in Fig. 1, given the uniformly distributed node coordinates \((u_i^{(1)}, u_j^{(2)})\), the SBM corresponds to F being a regular-grid constant graphon, in which the partitions along each dimension are crossed over to form the blocks; the MP-RM characterizes an F being a k-d tree-structured constant graphon, in which the blocks are hierarchically aligned; the RTP-RM assumes an F being an arbitrary rectangle constant graphon, in which the blocks are floor-plan aligned; and the GP-RM lets the F perform a continuous two-dimensional function. While taking different forms, these graphon functions commonly map from pairs of node coordinates in a unit square to intensity values in a unit interval. For the above piecewise-constant graphons, we can write the function F as \(F(u_i^{(1)},u_{j}^{(2)}|\{\omega _k, \Box _k\})=\sum _k\omega _k\cdot \pmb {1}((u_i^{(1)},u_{j}^{(2)})\in \Box _k)\), where \(\Box _k\) is the kth block and \(\omega _k\) refers to the intensity variable of \(\Box _k\). As shown in Fig. 1, the darker colour for the pair of node coordinates indicates the higher intensity in the interval, which corresponds to a larger probability of observing or generating the relationship between the pair of nodes.

2.2 Piecewise-constant graphons and their limitations

Many alternative piecewise-constant graphons can be implemented to model exchangeable relational data R, where R is a binary adjacency matrix which can be either directed (asymmetric) or undirected (symmetric). Here we consider the more complicated situation where R is a \(n\times n\) asymmetric matrix with \(R_{ji}\ne R_{ij}\) (the extension of our method to the symmetric case is straightforward). For any two nodes in R, if node i is related to node j then \(R_{ij}=1\), otherwise \(R_{ij}=0\).

We take the SBM as an illustrative example. In a two-dimensional SBM, there are two distributions generating the groups, \(\pmb {\theta }^{(1)},\pmb {\theta }^{(2)}\sim \text {Dirichlet}(\pmb {\alpha }_{1\times K})\), where K is the number of groups and \(\pmb {\alpha }_{1\times K}\) is the K-vector concentration parameter. Each node \(i\in \{1, \ldots , n\}\) is associated with two hidden labels \(z_i^{(1)}, z_i^{(2)}\in \{1, \ldots , K\}\), and \(\{z_i^{(1)}\}_{i}\sim \text {Categorical}(\pmb {\theta }^{(1)}), \{z_i^{(2)}\}_{i}\sim \text {Categorical}(\pmb {\theta }^{(2)})\) for \(i=1,\ldots ,n\). Hence, \(z_i^{(1)}\) and \(z_i^{(2)}\) denote the particular groups that node i belongs to in two dimensions, respectively. (That is, \(z_i^{(1)}\) is the group of node i when i links to other nodes, and \(z_i^{(2)}\) is the group of node i when other nodes link to it.) The relation \(R_{ij}\) from node i to node j is then generated based on the interaction between their respective groups \(z_i^{(1)}\) and \(z_j^{(2)}\).

Let \(\pmb B\) be a \(K\times K\) matrix, where each entry \(B_{k_1, k_2}\in [0, 1]\) denotes the probability of generating a relation from group \(k_1\) in the first dimension to group \(k_2\) in the second dimension. For \(k_{1},k_{2}=1, \ldots , K, B_{k_{1},k_{2}}\sim \text {Beta}(\alpha _0, \beta _0)\), where \(\alpha _0, \beta _0\) are hyper-parameters for \(\{B_{k_1,k_2}\}_{k_1, k_2}\). That is, we have \(P(R_{ij}=1|z_i^{(1)}, z_j^{(2)}, \pmb B)=B_{z_i^{(1)}, z_j^{(2)}}\).

Now, consider the SBM from the graphon perspective (Fig. 1; left). Let \(\pmb {\theta }^{(1)},\pmb {\theta }^{(2)}\) be group (or segment) distributions of the two dimensions in a unit square respectively. The generation of hidden labels \(z_i^{(1)}\) for node i and \(z_j^{(2)}\) for node j proceeds as follows: Uniform random variables \(u_i^{(1)}\) and \(u_j^{(2)}\) are respectively generated in the first and second dimensions. Then, \(z_i^{(1)}\) and \(z_j^{(2)}\) can be determined by checking in which particular segments of \(\pmb {\theta }^{(1)}\) and \(\pmb {\theta }^{(2)}\), \(u_i^{(1)}\) and \(u_j^{(2)}\) are located respectively. Formally, we have:

$$\begin{aligned}&\pmb {\theta }^{(1)}, \pmb {\theta }^{(2)}\sim \text {Dirichlet}(\pmb {\alpha }_{1\times K}),\,\, \,\,u_i^{(1)}, u_j^{(2)}\sim \text {Unif}[0, 1] \nonumber \\&z_i^{(1)}=(\pmb {\theta }^{(1)})^{-1}(u_i^{(1)}),\,\, \,\, z_j^{(2)}=(\pmb {\theta }^{(2)})^{-1}(u_j^{(2)}), \end{aligned}$$
(1)

where \((\pmb {\theta }^{(1)})^{-1}(u_i^{(1)})\) and \((\pmb {\theta }^{(2)})^{-1}(u_j^{(2)})\) respectively map \(u_i^{(1)}\) and \(u_j^{(2)}\) to particular segments of \(\pmb {\theta }^{(1)}\) and \(\pmb {\theta }^{(2)}\).

A regular-grid partition (\(\boxplus\)) can be formed in the unit square by combining the segment distributions \(\pmb {\theta }^{(1)}, \pmb {\theta }^{(2)}\) in two dimensions. Each block in this regular-grid partition is presented in a rectangular shape. Let \(L_{k}^{(1)}=\sum _{k'=1}^k{{\theta }}_{k'}^{(1)}\) and \(L_{k}^{(2)}=\sum _{k'=1}^k{{\theta }}_{k'}^{(2)}\) be the accumulated sum of the first k elements of \(\pmb {\theta }^{(1)}\) and \(\pmb {\theta }^{(2)}\) respectively (w.l.o.g. \(L_{0}^{(1)}=L_{0}^{(2)}=0, L_{K}^{(1)}=L_{K}^{(2)}=1\)). Use \(\Box _{k_1,k_2}=[L^{(1)}_{k_1-1}, L^{(1)}_{k_1}]\times [L^{(2)}_{k_2-1}, L^{(2)}_{k_2}]\) to represent the \((k_1, k_2)\)th block in the unit square of \([0,1]^2\), such that \(\bigcup _{k_1, k_2}\Box _{k_1,k_2}=[0,1]^2\). Then, an intensity function defined on the pair \((u_i, u_j)\) can be obtained by the piecewise-constant graphon function

$$\begin{aligned} g\left( u_i^{(1)}, u_j^{(2)}\right) = \sum _{k_1, k_2} \pmb {1}{((u_i^{(1)},u_j^{(2)})\in \Box _{k_1,k_2})} \cdot B_{k_1,k_2} \end{aligned}$$
(2)

where \(\pmb {1}(A)=1\) if A is true and 0 otherwise, and where \(B_{k_1, k_2}\in [0, 1]\) is the intensity of the \((k_1, k_2)\)th block. We term (2) the SBM-graphon. Thus, the generative process of the SBM-graphon can be described as:

  1. 1.

    For \(k_{1},k_{2}=1, \ldots , K\), generate \(B_{k_{1},k_{2}}\sim \text {Beta}(\alpha _0, \beta _0)\), where \(\alpha _0, \beta _0\) are hyper-parameters;

  2. 2.

    Generate the segment distributions \(\pmb {\theta }^{(1)}, \pmb {\theta }^{(2)}\) via Eq. (1) and form the partition (\(\boxplus\)) according to combinations of \(\pmb {\theta }^{(1)}, \pmb {\theta }^{(2)}\) in the unit square;

  3. 3.

    Uniformly generate the \(1^{st}\) dimension coordinates \(\{u_i^{(1)}\}_{i=1}^n\) and the \(2^{nd}\) dimension coordinates \(\{u_i^{(2)}\}_{i=1}^n\) for all nodes;

  4. 4.

    For \(i,j=1, \ldots , n\)

    1. (a)

      Calculate the intensity \(g(u_i^{(1)}, u_j^{(2)})\) according to Eq. (2) based on the node coordinates \((u_i^{(1)}, u_j^{(2)})\);

    2. (b)

      Generate \(R_{ij}\sim \text{ Bernoulli }(g(u_i^{(1)}, u_j^{(2)}))\).

Alternatively, if considering the latent labels (\(z_i^{(1)}, z_j^{(2)}\)) for nodes i and j, then from Eq. (1) and the deterministic relations between \(z_i^{(1)}, z_j^{(2)}\) and \(\pmb {\theta }_i^{(1)}, u_i^{(1)},\pmb {\theta }_i^{(2)}, u_i^{(2)}\), the intensity value \(g(u_i^{(1)}, u_j^{(2)})\) in step 4.(a) is the same as \(B_{z_i^{(1)}, z_j^{(2)}}\). As a result, step 4 in the above generative process can be equivalently written as

  1. 4.

    For \(i, j=1, \cdots , n\),

    1. (a)

      Generate the latent labels (\(z_i^{(1)}, z_j^{(2)}\)) via Eq. (1);

    2. (b)

      Generate \(R_{ij}\sim \text {Bernoulli}\left( B_{z_i^{(1)}, z_j^{(2)}}\right)\).

The SBM-graphon has several limitations. To begin with, the SBM-graphon function (Eq. (2)) is piecewise-constant. That is, the generated intensities for node pairs are discrete and the number of different intensity values is limited to the number of blocks in the partition (\(\boxplus\)). Consequently, this leads to an over-simplified description when modelling real relational data, which can result in at least two issues. On the one hand, as long as two nodes belong to the same segment in one dimension, their probability of generating relations with another node are the same even if the distance between the two nodes in that dimension is quite large. Conversely, given two nodes that are close in one dimension but belong to two adjacent segments, their probability of generating relations with another node could be dramatically different, depending on the respective block intensities (e.g., \(B_{k_1,k_2}\)).

The second limitation of the SBM-graphon is that it determines the intensity value for a pair of nodes by considering only the block (\(\Box _{k_1,k_2}\)) in which \((u_i, u_j)\) resides. However, the nodes relations with other nodes, especially neighbouring nodes in adjacent blocks, might also be expected to have a certain influence on the generation of the target relation, if one considers the relational data collectively. As a result, perhaps it could be beneficial to consider the interactions that naturally exist among all blocks when generating the relation \(R_{ij}\).

The third limitation of the SBM-graphon is that it provides latent information of node clustering as a side-product through the hidden labels \(\{z_i^{(1)}, z_i^{(2)}\}_{i=1}^n\). However, the clustering information might not be ideal because each node is assigned to only one cluster in each dimension. That is, when considering the outgoing relations from node i, it is assumed that node i consistently plays one single role in any relation with other nodes. In fact, in practice a node might play different roles by participating in different relations with different nodes. As a result, it would be more useful and flexible to allow a node to belong to multiple clusters in each dimension.

To address the limitations of piecewise-constant graphons (and in particular, the SBM-graphon), we propose a smoothing strategy to enable piecewise-constant graphons to produce continuous intensity values. The proposed smoothing graphons naturally consider interactions between the partitions and allow each node to play multiple roles in different relations.

3 Main models

3.1 The integrated smoothing graphon (ISG)

In order to improve on the limitations of the piecewise-constant graphon we first develop the Integrated Smoothing Graphon (ISG), based on the SBM-graphon construction. The piecewise-constant nature of the SBM-graphon is created through the use of an indicator function in (2) that selects only the particular block accommodating the target node pair. Accordingly, we replace the indicator function with an alternative that can produce continuous intensity values. Moreover, to capture the interaction between all blocks, we construct the smoothing graphon function to generate the intensity value as a summation over all block intensities, weighted by the importance of each block. Let \(F_{\Box _{k_1,k_2}}(u_i^{(1)}, u_j^{(2)})\) be the weight of the block \(\Box _{k_1,k_2}\) with respect to \((u_i^{(1)}, u_j^{(2)})\in [0, 1]^2\). The mixture intensity \(g\left( u_i^{(1)}, u_j^{(2)}\right)\), used to determine the \(R_{ij}\), can then be represented as

$$\begin{aligned} g\left( u_i^{(1)}, u_j^{(2)}\right) = \sum _{k_1, k_2}F_{\Box _{k_1,k_2}}(u^{(1)}_i, u^{(2)}_j)\cdot B_{k_1, k_2}, \end{aligned}$$
(3)

where \(\sum _{k_1,k_2}F_{\Box _{k_1,k_2}}(u^{(1)}_i, u^{(2)}_j)=1\).

The ISG generative process can be summarised as:

1)\(\sim\)3) The block intensities (\(\pmb B\)), graphon partition (\(\boxplus\)) and two-dimensional coordinates (\(\{u_{i}^{(1)}, u_{i}^{(2)}\}_{i=1}^n\)) are generated as for the SBM-graphon;

  1. 4.

    For \(i, j=1, \cdots , n\),

    1. (a)

      Calculate the mixture intensity \(g\left( u_i^{(1)}, u_j^{(2)}\right)\) according to (3) for the node coordinates \((u_i^{(1)}, u_j^{(2)})\);

    2. (b)

      Generate \(R_{ij}\sim \text{ Bernoulli }\left( g\left( u_i^{(1)}, u_j^{(2)}\right) \right)\).

As a consequence, while the SBM-graphon determines the relation intensity based only on the single block in which \((u_i^{(1)}, u_j^{(2)})\) resides, the ISG computes a mixture intensity as a weighted (and normalised) sum of all block intensities. That is, instead of assigning a weight of 1 for one particular block and weights of 0 for all other blocks, the ISG weights the importance of each block with respect to the pair of node coordinates \((u_i^{(1)}, u_j^{(2)})\). As long as the weighting function \(F_{\Box _{k_1,k_2}}(u_i^{(1)}, u_j^{(2)})\) is continuous, it follows that the mixture intensity (3) is also continuous. The intensity function (3) then becomes a smoothing graphon function.

The ISG allows the mixture intensity to take any value between the minimum and maximum of all block intensities. As a result, the ISG provides more modelling flexibility compared to the SBM-graphon, where only limited discrete intensity values (equivalent to the number of blocks) are available to describe relations.

3.2 Construction of the mixture intensity

To ensure that the graphon function (3) is continuous, we consider an integral-based weighting function of the form

$$\begin{aligned} {F}_{\Box _{k_1,k_2}}(u_i^{(1)}, u_j^{(2)}) \propto \int _{L^{(1)}_{k_1-1}}^{L^{(1)}_{k_1}}f(x-u_i^{(1)})dx \cdot {\int _{L^{(2)}_{k_2-1}}^{L^{(2)}_{k_2}}f(x-u_j^{(2)})dx} \end{aligned}$$
(4)

where \(f(x-u)\) is a univariate derivative function. Beyond the continuity requirement, \(f(x-u)\) and \({F}_{\Box _{k_1,k_2}}(u_i^{(1)}, u_j^{(2)})\) should satisfy the following three conditions:

  1. 1.

    \(f(x-u)\) is non-negative;

  2. 2.

    \(f(x-u)\) increases with decreasing distance (i.e. \(|x-u|\)) between x and the corresponding coordinate u. This condition means that the closer the block \(\Box _{k_1,k_2}\) is to the pair of node coordinates \((u_i^{(1)}, u_j^{(2)})\), the larger the weight that will be assigned to the block. The maximum weight value is achieved when \(|x-u_i^{(1)}|=0\) and \(|x-u_j^{(2)}|=0\);

  3. 3.

    The total weight of all blocks remains invariant regardless of different partitioning of the unit space. That is, \({F}_{\Box _{k_1,k_2}}(u_i^{(1)}, u_j^{(2)})={F}_{\Box '_{k_1,k_2}}(u_i^{(1)}, u_j^{(2)})+{F}_{\Box ''_{k_1,k_2}}(u_i^{(1)}, u_j^{(2)})\), where \(\Box '_{k_1,k_2}, \Box ''_{k_1,k_2}\) are sub-boxes of \(\Box _{k_1,k_2}\) such that \(\Box _{k_1,k_2}=\Box '_{k_1,k_2}\cup \Box ''_{k_1,k_2}\) and \(\Box '_{k_1,k_2}\cap \Box ''_{k_1,k_2}=\emptyset\).

It is also expected that \({F}_{\Box _{k_1,k_2}}(u_i^{(1)}, u_j^{(2)})\) can be normalised over all the \(K^2\) community pairs as \(\sum _{k_1,k_2}{F}_{\Box _{k_1,k_2}}(u_i^{(1)}, u_j^{(2)})=1\).

Fig. 2
figure 2

Top row: the influence of the \(\lambda\) parameter on the Laplace probability density function with coordinate located at \(u=0.3\) (left), and the corresponding mixture intensities for \(\{u_i\}_{i=1}^n\) (right). Different colors represent different settings of \(\lambda\). The gray dotted lines represent segment division in one dimension with \(\pmb {\theta }=(0.15, 0.27, 0.08, 0.5)^\top \sim \text {Dirichlet}(1, 1, 1,1)\). Bottom row: visualizations of the Integrated Smoothing Graphon under the Stochastic Block Model for different values of \(\lambda\). Darker shading represents higher graphon intensity

There are many candidate functions satisfying these conditions, such as Gaussian or Laplace probability density functions. For ease of computation and convenience of integration, we use the scaled Laplace density (with location parameter \(\mu = 0\)) as the derivative function \(f_{\lambda }(x-u)\). We leave other function choices for future work. In particular, we define \(f_{\lambda }(x-u)=\frac{{{\lambda }}}{2}\frac{e^{{-{{\lambda }} |x-u|}}}{G_{\lambda }(1-u)-G_{\lambda }(-u)}\), where \(G_{\lambda }(x-u)=\left\{ \begin{array}{lr} \frac{1}{2}e^{{{\lambda }} (x-u)}; &{} (x-u)<0 \\ 1-\frac{1}{2}e^{-{{\lambda }} (x-u)}; &{} (x-u)\ge 0 \end{array}\right.\). We then have \(\int _{L_{k_1-1}}^{L_{k_1}}f_{\lambda }(x-u)dx=\frac{G_{\lambda }(L_{k_1}-u)-G_{\lambda }(L_{k_1-1}-u)}{G_{\lambda }(1-u)-G_{\lambda }(-u)}\). As a result, for relation \(R_{ij}\) and corresponding node coordinates \((u^{(1)}_i, u^{(2)}_j)\), the normalised weight \(F_{\Box _{k_1,k_2}}(u^{(1)}_i, u^{(2)}_j)\) of the \((k_1, k_2)\)th block \(\Box _{k_1,k_2}\) contributing to the mixture intensity of \(R_{ij}\) is given by

$$\begin{aligned} F_{\Box _{k_1,k_2}}(u^{(1)}_i, u^{(2)}_j) =&\frac{G_{\lambda }(L^{(1)}_{k_1}-u^{(1)}_i)-G_{\lambda }(L_{k_1-1}^{(1)}-u^{(1)}_i)}{G_{\lambda }(1-u^{(1)}_i)-G_{\lambda }(-u^{(1)}_i)}\nonumber \\&\times \frac{G_{\lambda }(L^{(2)}_{k_2}-u^{(2)}_j)-G_{\lambda }(L_{k_2-1}^{(2)}-u^{(2)}_j)}{G_{\lambda }(1-u^{(2)}_j)-G_{\lambda }(-u^{(2)}_j)}. \end{aligned}$$
(5)

Proposition 1

\(\sum _{k_1,k_2}F_{\Box _{k_1,k_2}}(u^{(1)}_i, u^{(2)}_j)=1\).

Proof

   

$$\begin{aligned}&\sum _{k_1,k_2}F_{\Box _{k_1,k_2}}(u^{(1)}_i, u^{(2)}_j) \nonumber \\&\quad =\left[ \sum _{k_1}\frac{G_{\lambda }(L^{(1)}_{k_1}-u^{(1)}_i)-G_{\lambda }(L_{k_1-1}^{(1)}-u^{(1)}_i)}{G_{\lambda }(1-u^{(1)}_i)-G_{\lambda }(-u^{(1)}_i)}\right] \nonumber \\&\qquad \cdot \left[ \sum _{k_2}\frac{G_{\lambda }(L^{(2)}_{k_2}-u^{(2)}_j)-G_{\lambda }(L_{k_2-1}^{(2)}-u^{(2)}_j)}{G_{\lambda }(1-u^{(2)}_j)-G_{\lambda }(-u^{(2)}_j)}\right] =1. \end{aligned}$$
(6)

\(\square\)

Figure 2 (left) illustrates the function curves of \(f_{\lambda }(x-u)\) for \(u=0.3\) and Fig. 2 (right) shows the resulting one-dimensional mixture of intensities under varying scale parameter values \(\lambda =0.25, 25\) and 250. It is easily observed that, when \(\lambda\) is smaller, both the curves of the derivative function and the mixture intensity become flatter and smoother. Conversely, for larger \(\lambda\), the mixture intensity values (generated for the coordinate 0.3) become more discrete. Bottom row of Fig. 2 visualizes the mixture intensities obtained by applying the ISG to the SBM under the same three \(\lambda\) values.

Proposition 2

\(\lambda\) controls the smoothness of the graphon, with \(\lambda \rightarrow \infty\) recovering the piecewise-constant graphon, and \(\lambda \rightarrow 0\) resulting in a globally constant graphon.

Proof

Using L’Hôpital’s rule, when \(\lambda \rightarrow 0\), we have

$$\begin{aligned}&\lim _{\lambda \rightarrow 0} \frac{G_{\lambda }(L^{(1)}_{k_1}-u^{(1)}_i)-G_{\lambda }(L_{k_1-1}^{(1)}-u^{(1)}_i)}{G_{\lambda }(1-u^{(1)}_i)-G_{\lambda }(-u^{(1)}_i)}\nonumber \\&\quad =\frac{L_{k_1}^{(1)}-u_i^{(1)}-(L_{{k_1}-1}^{(1)}-u_i^{(1)})}{1-u_i^{(1)}+u_i^{(1)}} = L_{k_1}^{(1)}-L_{{k_1}-1}^{(1)}. \end{aligned}$$
(7)

Thus, we get \(F_{\Box _{k_1,k_2}}(u^{(1)}_i, u^{(2)}_j) = (L_{k_1}^{(1)}-L_{k_1-1}^{(1)})(L_{k_2}^{(2)}-L_{k_2-1}^{(2)})\), which is unrelated to the coordinate of \((u^{(1)}_i, u^{(2)}_j)\). The graphon is a globally constant graphon.

We have three different cases when \(\lambda \rightarrow \infty\): case (1), \(L_{k_1}^{(1)}>L_{k_1-1}^{(1)}>u_i^{(1)}\), we have

$$\begin{aligned}&\lim _{\lambda \rightarrow \infty } \frac{G_{\lambda }(L^{(1)}_{k_1}-u^{(1)}_i)-G_{\lambda }(L_{k_1-1}^{(1)}-u^{(1)}_i)}{G_{\lambda }(1-u^{(1)}_i)-G_{\lambda }(-u^{(1)}_i)}\\&\quad = \lim _{\lambda \rightarrow \infty } \frac{1-\frac{1}{2}e^{-\lambda (L^{(1)}_{k_1}-u^{(1)}_i)}-(1-\frac{1}{2}e^{-\lambda (L^{(1)}_{k_1-1}-u^{(1)}_i)})}{1-\frac{1}{2}e^{-\lambda (1-u^{(1)}_i)}-\frac{1}{2}e^{-\lambda (u^{(1)}_i)}}=0; \end{aligned}$$

case (2), \(L_{k_1-1}^{(1)}<L_{k_1-1}^{(1)}<u_i^{(1)}\), we have

$$\begin{aligned}&\lim _{\lambda \rightarrow \infty } \frac{G_{\lambda }(L^{(1)}_{k_1}-u^{(1)}_i)-G_{\lambda }(L_{k_1-1}^{(1)}-u^{(1)}_i)}{G_{\lambda }(1-u^{(1)}_i)-G_{\lambda }(-u^{(1)}_i)}\\&\quad = \lim _{\lambda \rightarrow \infty } \frac{\frac{1}{2}e^{\lambda (L^{(1)}_{k_1}-u^{(1)}_i)}-(\frac{1}{2}e^{\lambda (L^{(1)}_{k_1-1}-u^{(1)}_i)})}{1-\frac{1}{2}e^{-\lambda (1-u^{(1)}_i)}-\frac{1}{2}e^{-\lambda (u^{(1)}_i)}}=0; \end{aligned}$$

case (3), \(L_{k_1-1}^{(1)}<u_i^{(1)}<L_{k_1-1}^{(1)}\), we have

$$\begin{aligned}&\lim _{\lambda \rightarrow \infty } \frac{G_{\lambda }(L^{(1)}_{k_1}-u^{(1)}_i)-G_{\lambda }(L_{k_1-1}^{(1)}-u^{(1)}_i)}{G_{\lambda }(1-u^{(1)}_i)-G_{\lambda }(-u^{(1)}_i)}\\ =&\lim _{\lambda \rightarrow \infty } \frac{1-\frac{1}{2}e^{-\lambda (L^{(1)}_{k_1}-u^{(1)}_i)}-(\frac{1}{2}e^{\lambda (L^{(1)}_{k_1-1}-u^{(1)}_i)})}{1-\frac{1}{2}e^{-\lambda (1-u^{(1)}_i)}-\frac{1}{2}e^{-\lambda (u^{(1)}_i)}}=1. \end{aligned}$$

That is, \(F_{\Box _{k_1,k_2}}(u^{(1)}_i, u^{(2)}_j)=1\) if and only if the coordinate \((u^{(1)}_i, u^{(2)}_j)\) locates in the \((k_1, k_2)\)th block. Thus, this smoothing graphon only becomes piecewise-constant when \(\lambda \rightarrow \infty\). \(\square\)

Accordingly, we refer to \(\lambda\) as the smoothing parameter.

3.3 Latent feature smoothing graphon (LFSG) with probabilistic assignment

While the ISG addresses the limitations of the SBM-graphon by generating continuous intensity values, its graphon function (3) indicates that all blocks are involved in calculating the mixture intensity for generating individual relations. Accordingly, the additive form for evaluating the mixture intensity makes it difficult to form efficient inference schemes for all random variables. To improve inferential efficiency we introduce auxiliary pairwise latent labels \(\{s_{ij}\}_{j=1}^n\) (associated with node i) and \(\{r_{ij}\}_{i=1}^n\) (associated with node j) for individual relations \(\{R_{ij}\}_{i,j=1}^n\), where \(s_{ij},r_{ij} \in \{1, \ldots , K\}\) are the sender and receiver effects respectively. The \(\{s_{ij}\}_{j=1}^n\) and \(\{r_{ij}\}_{i=1}^n\) are sampled from the respective node categorical distributions in their corresponding dimensions using normalised weights as probabilities. In particular

$$\begin{aligned} \{s_{ij}\}_{j=1}^n&\sim \text {Categorical}(F^{(1)}_{1}(u^{(1)}_i), \ldots , F^{(1)}_{K}(u^{(1)}_i)) \nonumber \\ \{r_{ij}\}_{i=1}^n&\sim \text {Categorical}(F^{(2)}_{1}(u^{(2)}_j), \ldots , F^{(2)}_{K}(u^{(2)}_j)), \end{aligned}$$
(8)

where \(F_{k}(u)=\frac{G_{\lambda }(L_{k}-u)-G_{\lambda }(L_{k-1}-u)}{G_{\lambda }(1-u)-G_{\lambda }(-u)}\) is the normalised weight of segment k in the dimension of coordinate u. For each relation from node i to node j (\(R_{ij}\)), the hidden label \(s_{ij}\) denotes the group that node i belongs to (in the 1st dimension) and \(r_{ij}\) denotes the group that node j belongs to (in the 2nd dimension). Through the introduction of the two labels, the final intensity in determining \(R_{ij}\) can be obtained similarly to the Mixed Membership Stochastic Block Model (MMSB) (Airoldi et al., 2009):

$$\begin{aligned} P(R_{ij}=1|s_{ij}, r_{ij}, \pmb B)=B_{s_{ij}, r_{ij}}. \end{aligned}$$
(9)

Note that since both \(\{s_{ij}\}_{j=1}^n\) and \(\{r_{ij}\}_{j=1}^n\) are n-element arrays, each node has the potential to belong to multiple segments, rather than the single segment under the SBM-graphon. When participating in different relations, each outgoing node i (incoming node j) can fall into different segments, which means that each node can play different roles when taking part in different relations. Note that assuming expectations over the hidden labels \(s_{ij}\) and \(r_{ij}\), results in the same intensity as for the ISG, so that

$$\begin{aligned} {\mathbb {E}}_{s_{ij}, r_{ij}}\left[ P(R_{ij}=1\vert s_{ij}, r_{ij}, \pmb B)\right] = g\left( u_i^{(1)}, u_j^{(2)}\right) . \end{aligned}$$
(10)

We term this approach the Latent Feature Smoothing Graphon (LFSG). Its generative process is described as follows:

1)\(\sim\)3) The block intensities (\(\pmb B\)), graphon partition (\(\boxplus\)) and 2-dimensional coordinates (\(\{u_{i}^{(1)}, u_{i}^{(2)}\}_{i=1}^n\)) are generated as for the SBM-graphon;

  1. 4.

    For \(i=1, \cdots , n\), calculate the hidden label distributions in each dimension, \({\pmb {F}}^{(1)}(u_i^{(1)})\) and \({\pmb {F}}^{(2)}(u_i^{(2)})\), where \({\pmb {F}}^{(1)}(u_i^{(1)})=({F}^{(1)}_1(u_i^{(1)}), \ldots , {F}^{(1)}_K(u_i^{(1)}))\);

  2. 5.

    For \(i, j=1, \cdots , n\),

    1. (a)

      Generate the hidden labels \(s_{ij}\sim {\pmb {F}}^{(1)}(u_i^{(1)}), r_{ij}\sim {\pmb {F}}^{(2)}(u_j^{(2)})\) following (8)

    2. (b)

      Generate \(R_{ij}\sim \text {Bernoulli}\left( B_{s_{ij}, r_{ij}}\right)\).

Within the LFSG, \(\lambda\) can provide additional insight into the latent structures, in that it indicates the extent to which nodes belong to multiple communities. Larger (smaller) values of \(\lambda\) indicate that nodes are likely to belong to fewer (more) communities. The number of communities for the LFSG models can be determined using similar strategies as for the SBM and ISG model.

Comparing the LFSG and the MMSB The MMSB model is another notable Bayesian method for modelling exchangeable relational data. In contrast to other graphon methods, the MMSB model allows each node i to have a group distribution \(\pmb {F}_i\), which follows a Dirichlet distribution. To form the relation between any two nodes ij, a latent label pair consisting of a sender and a receiver \((s_{ij}, r_{ij})\) is first generated via \(s_{ij}\sim \text {Categorical}(\pmb {F}_i)\), and \(r_{ij}\sim \text {Categorical}(\pmb {F}_j)\). The relation \(R_{ij}\) can then be generated based on the intensity of the block \(\pmb {B}\) formed by group \(s_{ij}\) and group \(r_{ij}\): \(R_{ij}\sim \text {Bernoulli}(B_{s_{ij},r_{ij}})\). Our proposed LFSG model shares similarities with the MMSB model, since both of them use group distributions to represent individual nodes and the likelihood generation method is the same. However, there are key differences. These are: (1) The priors for the group distributions are different. In the MMSB model, the group distributions of all nodes are generated independently from a Dirichlet distribution, whereas in the LFSG model, the group distributions are highly dependent, since all are determined by the same partition structure and nodes coordinates (see Eq. 5); (2) The MMSB model requires nK parameters to form the group distributions, while the LFSG model requires only \(2(n+K)\) parameters.

The LFSG model naturally fits within the graphon framework. The MMSB model can also be made to fit within the graphon framework in two ways. Firstly, by considering the group distributions \(\pmb {\pi }_i\in [0, 1]^D\) in the K-dimensional hypercube instead of the unit interval. Secondly, noting that the minimal condition on the function F under general graphon theory is that F is measurable, applying a transformation from [0, 1] to \([0,1]^K\) means that the MMSB model can also fit within the graphon framework on the unit square.

The graphical models for implementing the ISG within the SBM (referred to as the ISG-SBM), as well as for implementing the LFSG within the SBM (referred to as the LFSG-SBM) are illustrated in Fig. 3. The main difference between the two models – the introduction of the pairwise hidden labels \(s_{ij}\) and \(r_{ij}\) for generating each relation \(R_{ij}\) – allows the LFSG-SBM to enjoy the following advantages over the ISG-SBM:

  • The aggregated counting information of the hidden labels enables efficient Gibbs sampling of the block intensities \(\pmb {B}\). In the ISG (or SBM), the block intensity \(\pmb {B}\) is inferred through each node’s latent label, while \(\pmb {B}\) is inferred through the senders and receivers for each relation in the LFSG (or MMSB). Since the numbers of senders and receivers are larger than the number of latent labels for nodes, i.e. \(N^2>N\), the inference on \(\pmb {B}\) under the LFSG (or MMSB) is likely better than under the ISG (or SBM).

  • Calculation involving \(F_{\Box }\) is instead reduced to calculation involving \(F_{k}(u)\), avoiding the inclusion of all blocks when calculating the mixture intensity.

  • Because each node is allowed to belong to different groups when linking to other nodes, permitting differences in the natures of those links, the group distribution F(u) is then easily interpretable as the group membership distribution for that node. For example, a higher membership degree in group k indicates the node is more important or active in group k.

Model identifiability In a similar manner to the MMSB model, the LFSG model also mitigates issues of identifiability by relating the number of blocks to the low-rank property of the edge probability matrix. In this respect, the LFSG model can be regarded as a “restricted” version of the MMSB model (see above), in that while the group distributions in the LFSG are highly dependent, those in the MMSB model are independently generated. In our simulations we did not encounter any parameter non-identifiability. However, we note that the number of parameters in the LFSG model is smaller than that of the MMSB model (i.e. \(2(n+K)<nK\) for \(K\ge 3\) and a moderate value of n), and so the LFSG model may perform better than the MMSB model in overcoming issues of non-identifiability.

Fig. 3
figure 3

The graphical model for a the ISG-SBM and b the LFSG-SBM. (a) The weights of the blocks \(F_{\Box }\) are first calculated by using the partition \(\boxplus\), node coordinates U and smoothing parameter \(\lambda\). Then the \(F_{\Box }\) and block intensities \(\pmb B\) are integrated together to generate the exchangeable relations R. (b) The weight F for each node is individually generated using the partition \(\boxplus\), node coordinates U and the smoothing parameter \(\lambda\), based on the auxiliary hidden labels sr for the node pair relationship. Then sr are used to generate the exchangeable relational, R, together with the block intensities \(\pmb B\)

3.4 Extensions to other piecewise-constant graphons

The major difference between the construction of the existing piecewise-constant graphons is the generation process of partitions (\(\boxplus\); Fig. 1). As a result, our smoothing approach, while described for the SBM-graphon, can be straightforwardly applied to other piecewise-constant graphons. For example, to apply the ISG to other piecewise-constant graphons, we can similarly calculate a mixture intensity as a weighted sum of the intensities of all existing blocks. When the partitioned blocks are rectangular-shaped (as for e.g. the MP-RM, RTP-RM and RBP-RM), the intensity for each can be computed by independently integrating the derivative function over two dimensions. If the partitioned blocks are shaped as convex-polygons (as for e.g. the Binary Space Partitioning-Relational Model (BSP-RM) Fan et al., 2018), the intensity can be generated via integrating the derivative function over the polygon.

Nonparametric methods for Stochastic Block Models In addition to the above Bayesian models, there are a number of nonparametric approaches for implementing stochastic block models. These approaches differ in terms of statistical accuracy and computational complexity, and include likelihood-based methods (Amini et al., 2013; Bickel and Chen, 2009; Celisse et al., 2012; Choi et al., 2012; Zhao et al., 2012), moment-based methods (Anandkumar et al., 2013), convex optimization methods (Chen et al., 2012), and spectral clustering methods (Balakrishnan et al., 2011; Fishkind et al., 2013; Rohe et al., 2011; Sarkar and Bickel, 2015). These approaches typically aim to produce consistent parameter point estimators, rather than full posterior distributions on the model parameters as considered here.

4 Inference

figure a
figure b

We present a Markov Chain Monte Carlo (MCMC) algorithm for posterior model inference, with detailed steps for the ISG and the LFSG models as illustrated in Algorithms 1 and 2 respectively. In general, the joint distribution over the hidden labels \(\{s_{ij}, r_{ij}\}_{i,j=1}^n\), pairwise node coordinates \(\{u_{i}^{(1)}, u_{i}^{(2)}\}_{i=1}^n\), group distributions \(\pmb {\theta }^{(1)}, \pmb {\theta }^{(2)}\), the block intensities \(\pmb {B}\) and the smoothing parameter \(\lambda\) is:

$$\begin{aligned}&P(\{s_{ij}, r_{ij}, R_{ij}\}_{i,j=1}^n, \{u_{i}^{(1)}, u_{i}^{(2)}\}_{{i=1}}^n, \pmb {\theta }^{(1)}, \pmb {\theta }^{(2)}, \pmb {B}, \lambda |\alpha _0, \beta _0)\nonumber \\&\quad \propto \prod _{i,k}\left[ F_{k}^{(1)}(u_i^{(1)}|\pmb {\theta }^{(1)}, \lambda )^{m_{ik}^{(1)}}F_{k}^{(2)}(u_i^{(2)}|\pmb {\theta }^{(2)}, \lambda )^{m_{ik}^{(2)}}\right] \nonumber \\&\qquad \cdot \prod _{k_1, k_2}\left[ B_{k_1, k_2}^{N_{k_1, k_2}^{(1)}+\alpha _0-1}(1-B_{k_1, k_2})^{N_{k_1, k_2}^{(0)}+\beta _0-1}\right] \nonumber \\&\qquad \cdot \prod _{k}\left[ (\theta _{k}^{(1)})^{\alpha _{k}-1}(\theta _{k}^{(2)})^{\alpha _{k}-1}\right] \cdot P(\lambda ), \end{aligned}$$
(11)

where \(m_{ik}^{(1)}=\sum _{j=1}^n\pmb {1}(s_{ij}=k), m_{{ik}}^{(2)}=\sum _{j=1}^n\pmb {1}(r_{ji}=k), N_{k_1, k_2}^{(1)}=\sum _{(i,j):s_{ij}=k_1, r_{ij}=k_2}\pmb {1}(R_{ij}=1), N_{k_1, k_2}^{(0)}=\sum _{(i,j):s_{ij}=k_1, r_{ij}=k_2}\pmb {1}(R_{ij}=0)\). In this joint distribution, we have set the following prior distributions for the variables: \(s_{ij}\sim \text {Categorical}({\pmb {F}}(u_i^{(1)}|\pmb {\theta }^{(1)}, \lambda ))\), \(B_{k_1,k_2}\sim \text {Beta}(\alpha _0, \beta _0)\), \(\pmb {\theta }^{(1)}\sim \text {Dirichlet}(\pmb {\alpha }_{1\times K})\). We let \(\lambda \sim \text {Gamma}(0.1, 0.1)\) follow a vague Gamma distribution, where \(\text {Gamma}(a, b)\) is a Gamma distribution with mean a/b and variance \(a/b^2\)

The details for updating each parameter in the ISG and LFSG MCMC algorithms are listed below.

Updating \(\{u_{i}^{(1)}, u_{i}^{(2)}\}_{i=1}^n\) : Independent Metropolis-Hastings steps can be used to update the variables \(u_{i}^{(1)}, u_{i}^{(2)}\). We propose a new sample for \(u_{i}^{(1)}\) from \(u^*\sim \text {Beta}[\alpha _u, \beta _u]\), and accept this proposal with probability \(\min (1, \alpha _{u_i^{(1)}})\) where

$$\begin{aligned} \alpha _{u_i^{(1)}} = \frac{\text {Be}(u_i^{(1)}|\alpha _u,\beta _u)}{\text {Be}(u^*|\alpha _u,\beta _u)} \prod _{k}\frac{F_{k}^{(1)}(u^{*}|\pmb {\theta }^{(1)}, \lambda )^{m_{ik}^{(1)}}}{F^{(1)}_{k}(u_i^{(1)}|\pmb {\theta }^{(1)}, \lambda )^{m_{ik}^{(1)}}}, \end{aligned}$$
(12)

where \(\text {Be}(u|\alpha ,\beta )\) denotes the Beta density with parameters \(\alpha\) and \(\beta\) evaluated at u. The update for \(u_i^{(2)}\) proceeds likewise. Note that each of the 2n parameters \(\{u_i^{(1)},u_i^{(2)}\}_{i=1}^n\) can be updated in parallel. In our simulations we found that \(\alpha _u=\beta _u=1\) gave good sampler performance.

Updating \(\pmb {\theta }^{(1)}, \pmb {\theta }^{(2)}\): A random-walk Metropolis-Hastings step can be used to update \(\pmb {\theta }^{(1)}\), and \(\pmb {\theta }^{(2)}\). For \(\pmb {\theta }^{(1)}\) or \(\pmb {\theta }^{(2)}\) we draw a proposed sample \(\pmb {\theta }^*\sim \text {Dirichlet}(\pmb {\alpha }_{1\times K})\) from a Dirichlet distribution with concentration parameters \(\pmb {\alpha }_{1\times K}\). We accept the proposal \(\pmb {\theta }^*\) for w.l.o.g. \(\pmb {\theta }^{(1)}\) with probability \(\min (1, \alpha _{\pmb {\theta }^{(1)}})\), where

$$\begin{aligned} \alpha _{\pmb {\theta }^{(1)}} =\frac{\text {Diri}(\pmb {\theta }^*|\pmb {\alpha })}{\text {Diri}(\pmb {\theta }^{(1)}|\pmb {\alpha })}\cdot \prod _{i,k}\frac{F^{(1)}_{k}(u_i^{(1)}|\pmb {\theta }^{*}, \lambda )^{m_{ik}^{(1)}}}{F^{(1)}_{k}(u_i^{(1)}|\pmb {\theta }^{(1)}, \lambda )^{m_{ik}^{(1)}}}, \end{aligned}$$
(13)

where \(\text {Diri}(\pmb {\theta }|\pmb {\alpha })\) denotes the Dirichlet density with concentration parameter \(\pmb {\alpha }\) evaluated at \(\pmb {\theta }\). A similar update can be implemented for \(\pmb {\theta }^{(2)}\). Both \(\pmb {\theta }^{(1)}\) and \(\pmb {\theta }^{(2)}\) can be updated in parallel.

Updating \(\pmb B\): The conjugacy between the prior and the conditional likelihood for \(\pmb B\) means that we can update \(\pmb B\) via a Gibbs sampling step. Specifically, each entry \(B_{k_1, k_2}\) can be updated in parallel via

$$\begin{aligned} B_{k_1, k_2}\sim \text {Beta}(\alpha _0+N_{k_1, k_2}^{(1)}, \beta _0+N_{k_1, k_2}^{(0)}), \forall k_1, k_2. \end{aligned}$$
(14)

Updating \(\{s_{ij}, r_{ij}\}_{i,j=1}^n\): The posterior distribution of \(s_{ij}\) is a categorical distribution, where the probability of \(s_{ij}=k\) is

$$\begin{aligned} P(s_{ij}=k|\theta _{k}^{(i)}, R_{ij}, B_{k,r_{ij}})\propto \, F_{k}^{(1)}(u_i^{(1)}|\pmb {\theta }^{(1)}, \lambda )\times B_{k,r_{ij}}^{R_{ij}} (1-B_{k,r_{ij}})^{1-R_{ij}}, \end{aligned}$$
(15)

and from which \(s_{ij}\) can be straightforwardly updated (\(r_{ij}\) can be updated in a similar way). Each of the 2n parameters can be updated in parallel.

Updating \(\lambda\): A Metropolis-Hastings step can be used to update \(\lambda\). We use the random walk Metropolis-Hastings algorithm to propose a new value of \(\lambda ^*\) and accept it with probability \(\min (1, \alpha _{\lambda })\), where

$$\begin{aligned} \alpha _{\lambda }=\left[ \prod _{i,k}\frac{F^{(1)}_{k}(u_i^{(1)}|\pmb {\theta }^{(1)}, \lambda ^*)^{m_{i{k}}^{(1)}}F^{(2)}_{k}(u_i^{(2)}|\pmb {\theta }^{(2)}, \lambda ^*)^{m_{i{k}}^{(2)}}}{F^{(1)}_{k}(u_i^{(1)}|\pmb {\theta }^{(1)}, \lambda )^{m_{ik}^{(1)}}F^{(2)}_{k}(u_i^{(2)}|\pmb {\theta }^{(2)}, \lambda )^{m_{ik}^{(2)}}}\right] {\cdot \frac{[\lambda ^*]^{-0.9}e^{-0.1\lambda ^*}}{[\lambda ]^{-0.9}e^{-0.1\lambda }}}. \end{aligned}$$
(16)
Table 1 Per-iteration model complexity comparison (n is the number of nodes, K is the number of communities and L is the number of positive links)

4.1 Computational complexities

Table 1 compares the per-iteration computational complexities of sampling from the ISG and the LFSG models against representative existing models, including the SBM, the MMSB and the GP-RM. The ISG algorithm requires a computational complexity of \({\mathcal {O}}(K)\) to sample each coordinate \(u_i^{(1)}\) (\(u_i^{(2)}\)) (Eq.12), resulting in a total of \({\mathcal {O}}(NK)\) for all the coordinates; it needs a complexity of \({\mathcal {O}}(NK)\) to sample \(\pmb {\theta }^{(1)}\) (\(\pmb {\theta }^{(2)}\)) (Eq.13); it needs a complexity of \({\mathcal {O}}(n^2K^2)\) to sample all the block intensities \(\pmb {B}\) (Eq.14) and a complexity of \({\mathcal {O}}(NK)\) to sample \(\lambda\) (Eq.16). In particular, it requires a computational complexity of \({\mathcal {O}}(n^2K^2)\) to calculate the intensity for generating the relations \(\{R_{ij}\}_{i,j=1}^n\), since the calculation of the mixture intensity for each relation involves a pair of coordinates (giving a total of \(n^2\)) and all of the block intensities (which is \(K^2\)).

The computational complexity for the LFSG is similar to that of the ISG, as it needs a complexity of \({\mathcal {O}}(NK)\) to sample all coordinates, \({\mathcal {O}}(NK)\) to sample all partitions and \({\mathcal {O}}(NK)\) to sample \(\lambda\). The LFSG has a complexity of \({\mathcal {O}}(n^2K)\) to sample all the latent labels (Eq.15) and \(K^2L\) to sample all the block intensities.

However, the uncoupling strategy applied in the LFSG lowers this cost dramatically to \({\mathcal {O}}(K^2L)\), where L is the number of positive links (i.e. \(R_{ij}=1\)) observed in the data (Table 2 enumerates L for each data set analysed below). Note that the mixture intensity computation cost of the LFSG is the same as that of both the SBM and the MMSB. As a result, the continuous intensities of the LFSG compared to the discrete intensities of the SBM are achieved without sacrificing computation complexity. In contrast, the computational cost of computing the mixture intensity for the GP-RM is \({\mathcal {O}}(n^3)\) (Rasmussen, 2003), which is the highest among these methods, even though it can also provide continuous intensities. Regarding the complexity of sampling the labels, both the LFSG and the MMSB provide multiple labels for each node and incur the same cost of \({\mathcal {O}}(n^2K)\). However, while the SBM requires a smaller cost of \({\mathcal {O}}(nK)\) for label sampling, it only allows a single label for each node.

Fig. 4
figure 4

Average area under the curve receiver operating characteristic (AUC) and the precision recall (Precision) under the Stochastic Block Model (SBM, blue line), Mixed-membership Stochastic Block Model (MMSB, orange line), Mondrian Process-Relational Model (MP-RM, green line), Gaussian Process-Relational Model (GP-RM, red line), Latent Feature Smoothing Graphon on the SBM (LFSG, purple line) and Integrated Smoothing Graphon on the SBM (ISG, brown line) for each of the Delicious, Digg, Facebook, Flickr, Gplus and Twitter datasets, under different proportions of training data (x-axis).

Fig. 5
figure 5

AUC performance of the LFSG-SBM and the MMSB under different numbers (\(K^2\)) of blocks (in each dimension) for the Delicious, Digg, Facebook, Flickr, Gplus and Twitter datasets

5 Experiments

We now evaluate the performance of the ISG-SBM and the LFSG-SBM on real-world data sets, comparing them with four state-of-the-art methods: the SBM, the MP-RM, the MMSB and GP-RM. We implement posterior simulation for the SBM and the MMSB using Gibbs sampling and a conditional Sequential Monte Carlo algorithm(Andrieu et al., 2010; Lakshminarayanan et al., 2015; Fan et al., 2019) for the MP-RM. We used \(10\,000\) iterations for each sampling algorithm, retaining the final 5 000 iterations as post-burn-in draws from the posterior. Inspection of AUC and precision-value trace-plots indicated that \(5\,000\) iterations were more than enough to ensure sampler convergence.

Table 2 Dataset summary information (\(S(\%)\) is the sparsity of the positive links.)

5.1 Data sets

We examine six real-world exchangeable relational data sets: Delicious (Zafarani and Liu, 2009), Digg Zafarani and Liu (2009), Flickr Zafarani and Liu (2009), Gplus Leskovec and Mcauley (2012), Facebook Leskovec and Mcauley (2012), and Twitter Leskovec and Mcauley (2012). To construct the exchangeable relational data matrix we extract the top \(1\,000\) active nodes based on node interaction frequencies, and then randomly sample 500 nodes from these top \(1\,000\) nodes to form the \(500\times 500\) interaction binary matrix. Table 2 summarizes the number of positive links (L) and the corresponding sparsity (\(S\%\)), which is defined as the ratio of the number of positive links to the total number of links, for each dataset.

Table 3 Performance of neighborhood smoothing method of Zhang et al., (2017) with 90%/10% training/testing data, for each real world dataset
Fig. 6
figure 6

Visualisation of mixture intensity from one posterior draw, the posterior mean of smoothing parameter \(\lambda\) and pairwise hidden labels on the Delicious, Digg, Flickr, Gplus, Facebook and Twitter datasets when implementing the Latent Feature Smoothing Graphon within the Stochastic Block Model. There are three figures for each dataset. Left: the grey level in the unit square illustrates the predicted mixture intensity for each relation (darker = higher intensity), the dotted lines indicate the related partition (\(\boxplus =\pmb {\theta }^{(1)}\times \pmb {\theta }^{(2)}\)). Right: the different colours represent the different values of the latent labels \(s_{ij}\) (right top) and \(r_{ij}\) (right bottom), with the y-axis indicating different nodes (sorted by the ratio of the labels) and the x-axis showing the proportion of different labels for each node

Fig. 7
figure 7

Partition structure visualisations on the Delicious, Digg, Flickr, Gplus, Facebook and Twitter datasets when implementing the Latent Feature Smoothing Graphon within the Stochastic Block Model. Black dots refers to the observed linkages. We re-arrange the row and column indexes based on the nodes’ coordinates and these visualisations reflect the observed relational data matrix after the index re-arrangements

Fig. 8
figure 8

The coordinates of all the nodes visualisations on the Delicious, Digg, Flickr, Gplus, Facebook and Twitter datasets when implementing the Latent Feature Smoothing Graphon within the Stochastic Block Model. Black dots refers to the observed linkages. We re-arrange the row and column indexes based on the nodes’ coordinates and these visualisations reflect the observed relational data matrix after the index re-arrangements

5.2 Experimental setting

The hyper-parameters for each method are set as follows: for the SBM, LFSG-SBM, ISG-SBM, MMSB and MP-RM, the hyper-parameters \(\alpha _0, \beta _0\) used in generating the block intensities are set as \(\alpha _0=S\%, \beta _0=1-S\%\), where \(S\%\) refers to the sparsity shown in Table 2, such that the block intensity has an expectation equivalent to the sparsity of the exchangeable relational data; for the SBM, LFSG-SBM, ISG-SBM and MMSB, we set the group distribution of \(\pmb {\theta }\) as \(\text {Dirichlet}(\pmb {1}_{1\times 4})\). Hence, the number of groups in each dimension in these models is set as 4, with a total of 16 blocks generated in the unit square; for the MP-RM, the budget parameter is set to 3, which suggests that approximately \((3+1)\times (3+1)\) blocks would be generated. We use the generative processes of the corresponding models to initialize their random variables, as independent and random initialisation of these random variables would make the MCMC algorithm take a longer sequence of iterations to converge.

5.3 Link prediction performance

The performance of each model in the task of link prediction is shown in Fig. 4, which reports both the average area under the curve of the receiver operating characteristic (AUC) and the precision-recall (Precision). The AUC denotes the probability that the model will rank a randomly chosen positive link higher than a randomly chosen zero-valued link. The precision is the average ratio of correctly predicted positive links to the total number of predicted positive links. Higher values of AUC and precision indicate better model performance. For each dataset, we vary the ratio of training data from \(10\%\) to \(90\%\) and use the remainder for testing. The training/test data split is created in rows. In particular, we take the same ratio of training data from each row of the relational matrix, so that each node shares the same amount of training data. It is noted that the heterogeneity of the nodes’ degrees might make this choice sample more 1-valued links from higher degree nodes and might thus produce a systematic bias in the results.

From Fig. 4, both the AUC and precision of all models improves as the amount of training data increases. The trend generally becomes steady when the proportion is larger than 0.3, indicating the amount of data required to fit a model with a \(\sim\)16-block complexity.

Except for the Facebook data, we can see that the AUC and precision of both the ISG-SBM and the LFSG-SBM are better than for the piecewise-constant graphon models (i.e. the SBM and MP-RM) and the continuous graphon model (i.e. GP-RM). The proposed smoothing graphons typically achieve similar performance to the MMSB, demonstrating that the smoothing graphon strategy is useful for improving model performance. For the Facebook dataset, the SBM seems to perform better than the smoothing graphon-based models. This is examined in greater detail in the next section.

Figure 5 shows the relationships between the AUC performance and the number of blocks for the LFSG-SBM and the MMSB. In general, there are two stages for the behaviour of AUC values. Initially, the AUC values increase as the number of blocks becomes larger, possibly due to the enlarged model representation capability; then, the AUC values start to decline, even when the number of blocks continues to increase, possibly due to overfitting. We can see that the AUC values of the LFSG-SBM are usually better than those of the MMSB when the number of blocks is larger. That is, the LFSG-SBM is performs better than the MMSB on with respect to overfitting.

We additionally compare model performance with the neighborhood smoothing nonparametric method of Zhang et al. (2017), which uses a neighborhood smoothing to avoid the concept of blocks in the model. We implement this model using the R package graphon, using \(90\%/10\%\) of the data as training/testing data. The resulting link prediction performance is shown in Table 3. When compared with the results in Fig. 4, it is easy to see that the LFSG and ISG models have stronger performance.

5.4 Graphon and hidden label visualisation

In addition to the quantitative analysis, we visualise the generated graphons and hidden labels under the LFSG-SBM on all six data sets in Figs. 6 and 7. It is noted all these visualizations are based on single draws from the posterior distribution. In particular, these figures are created by re-ordering the nodes based on their estimated latent coordinates which depend on a particular posterior draw. For each dataset, we visualise the resulting mixture intensities for one posterior sample, with the learned posterior mean of the smoothing parameter \(\lambda\), based on using \(90\%\) training data. We observe that the displayed graphon intensities exhibit smooth transitions between blocks for each dataset, highlighting that continuous, rather than discrete, mixture intensity values are generated under the smoothing graphon. The transition speed of the intensity between blocks is influenced by the smoothing parameter \(\lambda\) – a larger value of \(\lambda\) leads to a less smooth graphon, and a smaller value of \(\lambda\) to a smoother graphon – similar to that observed in Fig. 2.

In Fig. 6, for each dataset, we also display the posterior proportions of the pairwise hidden labels \(s_{ij}\) (top right) and \(r_{ij}\) (bottom right) for each node. Here the x-axis indicates different nodes (sorted by the label probabilities) and the y-axis displays the posterior mean of label probabilities (each label represented by a different colour). For each node i on the x-axis, more colours observed on the y-axis indicates a greater diversity of groups associated with that node, which in turn represents a higher potential for that node to belong to different groups when interacting with other nodes. In other words, the larger the tendency away from vertical line transitions between groups in these plots, the larger the number of nodes belonging to multiple groups.

Compared with the value of the smoothing parameter \(\lambda\) learned on the other four data sets, the values of \(\lambda\) estimated from the Facebook and Twitter datasets are larger. Further, the visualisations of the hidden labels for these two data sets are partitioned by almost straight horizontal lines, which suggests that only one label is a realistic possibility for most of the nodes. This could explain why both the AUC and precision values of the ISG-SBM and the LFSG-SBM are less competitive compared with those of the SBM on these two datasets (Fig. 4). Here, the SBM assigns each node to exactly one group only, which aligns well with the ground-truth for these two datasets.

Another explanation for the performance on the Facebook and Twitter datasets is that we can recover the SBM if and only if \(\lambda = \infty\). For any finite value of the smoothing parameter \(\lambda\), it is impossible to have any posterior mass on the SBM. To this end, we might reparameterise by mapping \(\lambda\) to \((0, \infty )\rightarrow (0, 1)\) (e.g. via \(\lambda \rightarrow 1-e^{-\lambda }\)) such that we are able to place substantial posterior mass close to 1. Since this mapping would permit model fitting arbitrarily close to the SMB, we would then expect the ISG-SBM and LFSG-SBM models to perform similarly to or better than the SBM, even for the Facebook and Twitter datasets.

Figure 8 illustrates one sample partition drawn from their posterior distribution. As the black dots represent observed linkages, we can clearly see the pattern where they merge to form dense blocks. Furthermore, the dense regions in the partitions on the Facebook and Twitter datasets show clearer rectangular box shapes than those for the other datasets. This phenomenon is consistent with Fig. 6, in which the smoothing parameter \(\lambda\) is larger for these two datasets. For other datasets, the dense regions are not confined to regular shapes, and accordingly may be better modelled by our method.

6 Conclusion

In this paper, we have introduced a smoothing strategy to modify conventional piecewise-constant graphons in order to increase their continuity. Through the introduction of a single smoothing parameter \(\lambda\), we first developed the Integrated Smoothing Graphon (ISG) that addressed the key limitation of existing piecewise-constant graphons which only generate a limited number of discrete intensity values. To improve the computational efficiency of the ISG and to allow for the possibility of each node’s belonging to multiple groups, we further developed the Latent Feature Smoothing Graphon (LFSG) by the introduction of auxiliary hidden labels. Our experimental results verify the effectiveness of this smoothing strategy in terms of greatly improved AUC and precision scores in the task of link prediction. The visualisations of the generated graphons and the posterior hidden label summaries further provide an intuitive understanding of the nature of the smoothing mechanism for the given dataset.