1 Introduction

Human-centered networked systems, such as cyber-physical systems (CPS) or Internet of Things (IoT) [14], have increasingly accumulated a large amount of multimedia information, which includes texts, images, and videos. Such heterogeneous and inter-networked environments present the data in the format of graph-structured networks. Traditional human-centered computing methods to tackle such cyberspace data largely depend on the analytics of structural properties. However, they have difficulties in detecting the updated or hidden social relations due to the fact that the plain graphs with only graph topologies (including nodes and edges) sometimes limit the scope of analyzing the modern social and physical networks. For example, we might be prohibited from inferring the social behaviors of an unknown node or predicting the potential targets that a newly coming node would communicate with especially in the structureless networks. Thus, it is crucial to incorporate useful multimedia resources to discover unknown social relations.

Generally, the problem of predicting unknown relationships in networks is called “link prediction” [5, 6]. Researchers mostly concentrate on inferring the behavior of linking formation process through the observed connections in current networks. In recent years, network representation learning (NRL) [7, 8] has been proposed to support subsequent network processing and improve the performance of relation inference. It aims at learning the embedding space that can preserve the relationships for network reconstruction and support network inference effectively. Many NRL methods [912] have indeed advanced graph pattern discovery, analysis, and prediction. However, there exists a “blind zone” where few explicit connections can be observed in the network-structureless situations. We cannot expect that the complex network environments of CPS or IoT [1315] would always be well-structured and exactly take the whole picture of the networked systems. For instance, there might be nodes in the system which have been isolated from the main graph; thus, some valuable information might be blocked for the lack of explicit structural properties; or several newly arrived nodes without any topology information are waiting to be added to the main system or to connect with others. Thus, if a network is only analyzed from the perspective of the currently observed structural information, some potential social relations cannot be discovered since those hidden but vital relationships of nodes may not be preserved in the embedding space. The incomplete network structure forces us to neglect such implicit information. To the best of our knowledge, the existing NRL methods fail to handle structureless networks due to the fact that the inner core that drives those methods to work is the “currently observed connections.” Admittedly, this task seems intractable if the graph structure is the only feature that we can utilize. Therefore, it is difficult to construct those missing parts of the original network or to infer the undiscovered relationships of nodes. To be specific, if omitting the text data in Fig. 1, we will be confused about identifying the topic of the yellow node and the links it may have. Also, it will be unclear how to construct the potential relations like friendships or partnerships amongst a bunch of nodes in the network-structureless set. However, since modern networked systems have generated and collected a large amount of multimedia data, which can provide related clues for discovering unknown social relations. As illustrated in Fig. 1, if considering such textual information, we can infer that the yellow node would have a higher probability being linked to the blue node labeled with “network analysis” rather than the orange one labeled with “image processing,” and we can also establish meaningful connections for those black nodes due to their textual information.

Fig. 1
figure 1

Illustration of using text data to construct social relations. Color blue represents the topic of “network analysis,” while color orange represents the topic of “image processing.” The gray nodes consist of parts of the graph-structured data. The black nodes are in the structureless node set. The solid lines denote the existing links, while the dotted lines denote the potential links

Some researchers [1618] try to utilize textual information, but they concentrate on integrating and balancing graph-structured data and text data in network embedding to improve the performance of relation inference. Hence, they still struggle with the aforementioned problem that the latent social relations still cannot be detected from the incomplete networks. Under the circumstance that the structural information of needed-to-be-analyzed graph data is missing, we also consider text data as the accompanied resource. However, we attempt to bridge the gap between text data and graph-structured data instead so that the “observed texts” can be encoded to substitute for those incomplete structural information. As a consequence, we can predict missed or future social relations based on text-domain information. Hence, we think of applying deep domain adaptation (DDA) techniques to map the two modalities. DDA techniques embed domain adaptation in the deep learning frameworks for learning more transferable representations [19]. Since our task demands for generating simulated samples that are similar to the target samples and preserve the source domain information, we consider using generative adversarial networks (GANs) [20] to address the aforementioned problem due to the fact that the techniques applied in the existing structure-based NRL methods have difficulties in making valid predictions of social relations by the non-graph-structured data. Inspired by image-to-image translation [2123], we propose social relation GAN (SRGAN) for cross-domain knowledge translation. Two GANs are employed in our framework. One (namely t-GAN) aims to learn content-to-structure mapping by adversarially training a discriminator and a generator. Specifically, the discriminator tries to distinguish the real network embeddings from the fake embeddings transformed from the text domain, and the generator tries to fool the discriminator to make the fake embeddings look like the ones just learned from the graph domain. The other (namely g-GAN) learns structure-to-content mapping by inverting the task of t-GAN. In addition, as shown in [2123], the data reconstruction of the source or target samples can be helpful for improving the performance of domain adaptation. Thus, we also apply reconstruction techniques into our adversarial training process. Social relation is one of the essential characteristics in social networks. The “sociality” in this task is derived from two aspects: one is the original network where the structural information is mostly preserved, and the other is the structureless network where the text-domain information can be encoded to substitute for those incomplete structural information. SRGAN tries to make the transformed data reflect the sociability and the tendency to associate in or to form social connections of the graph-structured data. Experimental results on three real-world datasets show that translating meaningful social relations from the text-domain information is challenging, while SRGAN outperforms the baseline methods.

Overall, our main contributions lie in three aspects:

  1. 1

    Our approach is a remedy for most existing structure-based NRL techniques which have difficulties in handling such text-based network-structureless problems;

  2. 2

    We bridge the gap between graph-structured data and text data using GANs in the networked systems;

  3. 3

    Meaningful social relations can be translated from text-domain information by our proposed approach.

The rest of the paper is organized as follows. In Section 2, related work is briefly introduced. In Section 3, we present the approach of bridging the gap between graph-structured data and text data using GANs. After that, the proposed SRGAN is evaluated over several baselines and the detailed experiments are given in Section 4. Finally, we conclude our work and point out the future work in Section 5.

2 Related work

Human-centered techniques have achieved a great success in many real-world applications [2431]. As mentioned in [7], there are two goals for NRL. First, the learned embedding space can reconstruct the original network. The network relationships are reflected as the relative distance of any two nodes in the embedding space. If there is an edge between two nodes, then the distance of these two nodes should be relatively small. Second, network inference can be supported by the learned embedding space, such as link prediction, node identification, and label inference.

Hence, there are large amounts of structure-based methods proposed for learning network embedding spaces from the network topology. Inspired by natural language processing, Perozzi et al. [9] treated nodes as words, while paths generated by the random walk model over a network were regarded as sentences which were fed into word2vec framework [32] aiming at preserving the neighborhood structure. In [10, 33], they improved the network exploration strategy trying to capture more meaningful node sequences. In order to handle very large scale information networks, LINE [11] was proposed to preserve local and global network structures by utilizing the information of local pairwise proximity to learn half of the dimensions over neighbors of nodes and constraining the sampled nodes at a two-hop distance from the sources to learn the rest. To effectively capture the highly non-linear network structure, SDNE [34] exploited the first-order proximity and second-order proximity jointly to preserve the global and local structures. Similar to the image-based convolutional networks, Niepert et al. [35] proposed a framework for learning convolutional neural networks for arbitrary graphs. The graph convolutional network [36] used a localized first-order approximation of spectral graph convolutions for semi-supervised learning on graph-structured data. To enhance the robustness of network embeddings, Dai et al. [37] proposed an adversarial network embedding (ANE) framework, which utilized GANs to capture latent graph features. Gao et al. [38] generated proximities via GAN framework to discover the relationships between nodes. HeGAN [39] was developed for capturing the rich semantics on heterogeneous information networks. GraphRNA [40] was composed of a collaborative walking mechanism and a tailored deep embedding architecture, where the jointed random walks on attributed networks were utilized to boost the process of learning node representations.

In addition to the structure-based networks, there are some networks accompanied with rich external text-based information, such as content attributes or text profiles. Tu et al. [41] embedded nodes and edges into the same vector space based on the accompanied semantic information. Yang et al. [17] incorporated textual information of the nodes into NRL under a matrix factorization framework. CENE [18] jointly leveraged the network structure and the content information for enhancing the network representation. MMDW [42] tried to learn discriminative network representations by utilizing the labeling information of the nodes. CANE [16] was proposed for modeling the relationships between nodes given rich external information. He et al. [43] fused both structural and content information in a generative manner.

As is known to all, textual information is useful in learning network embedding spaces due to the reason that it can provide related clues for constructing relationships among network nodes. However, most existing text-enhanced NRL techniques concentrate on integrating and balancing graph-structured data and text data in network embedding, which means they still need structural information of the unseen nodes. While in some real-world scenarios (as discussed in Section 1), we might not know such information beforehand. Therefore, we apply DDA methods to learn transferable representations for mapping the two modalities. Since the task needs to make the transformed data similar to the target one and preserve the source domain information. Thus, similar to the adversarial-based and reconstruction-based DDA approaches [2123, 44, 45], we introduce GANs to address the aforementioned problem. The main difference between our proposed approach and previous NRL work is that we focus on the “blind zone,” trying to infer meaningful social relations in structureless network environments from the accompanied text resources. Our approach can be regarded as a remedy for most existing NRL techniques to analyze text-based networked systems.

3 SRGAN

3.1 Generative adversarial networks

In order to capture the “sociality,” GANs [20] are employed in our framework. The necessity of using GANs lies in that the adversarial process can break the barriers in multi-modal data so that the potential social connections in some structureless situations can be inferred by the non-graph-structured data. The basic idea behind GANs is to set up a game between a generator and a discriminator [46]. The generator tries to mimic the distribution of the training data, i.e., it generates fake samples that are intended to be indistinguishable from the real ones as much as possible; while the discriminator determines whether a sample is fake or real using supervised learning techniques.

3.2 Text-based network definitions

Let \(\mathcal {G}=(V,E)\) be a graph, where V denotes the set of vertices (nodes) and E⊂(V×V) denotes the set of edges. VV denotes the set of vertices lacking structural information, while V′′=VV denotes the rest with known structures. Let T be the set of textual information of V. For the purpose of translating social relations from text-domain information, without loss of generality, assume there exists an injective mapping f:TV between the two types of data, i.e., \(\mathcal {S}=\{(t,v)|v=f(t),t \in T,v \in V\}\).

We also present two important concepts as follows.

Real embeddings are the mathematical embeddings in a continuous vector space learned by domain-specific representation learning techniques.

Fake embeddings are created by the generators trying to make discriminators incapable of separating them from the real embeddings.

3.3 Cross-domain knowledge translation

Our task can be categorized to the heterogeneous domain adaptation setting [19]. It is defined as transforming text-modal data to graph-modal data by cross-modal mapping knowledge learned from two domain information. Suppose \(\mathcal {X}_{V} \in \mathbb {R}^{|V| \times d_{V}}\) denotes the embedding space of graph-modal data, and \(\mathcal {X}_{T} \in \mathbb {R}^{|T| \times d_{T}}\) denotes the embedding space of text-modal data. \(\mathcal {X}_{V} \ne \mathcal {X}_{T}\). dV and dT are the small numbers of latent dimensions. The purpose is to learn the cross-modal mapping g(·) with the following characteristics:

Indistinguishability. For \(\mathbf {x}_{t} \in \mathcal {X}_{T}, g(\mathbf {x}_{t}) \in \mathcal {X}_{V}\). The transformed data g(xt) can exactly map the form of graph-modal data in the embedding space \(\mathcal {X}_{V}\). Meanwhile, it should be hard for a domain discriminator to differ the transformed one from the original one.

Structure awareness. For \((t, v) \in \mathcal {S}, g(\mathbf {x}_{t})\) should play the role of xv in the embedding space \(\mathcal {X}_{V}\) to some extent. In other words, the transformed data g(xt) should be able to hold the structural relationships of xv with network homophily [47]. Hence, the empirical structure-preserving objective is as follows.

$$ \begin{aligned} \min\limits_{g} &\sum_{u \in V}\lvert P(g(\mathbf{x}_{t})|\mathbf{x}_{u})-P(\mathbf{x}_{v}|\mathbf{x}_{u})\rvert\\ &+\sum_{u \in V}\lvert P(\mathbf{x}_{u}|g(\mathbf{x}_{t}))-P(\mathbf{x}_{u}|\mathbf{x}_{v})\rvert, \end{aligned} $$
(1)

where |·| denotes the symbol of absolute value and P(·|·) denotes the conditional probability defined in [11, 16]. For example, suppose P(xv|xu) is the conditional probability of node v generated by node u:

$$ P(\mathbf{x}_{v}|\mathbf{x}_{u})=\frac{\exp\left(\mathbf{x}_{v}^{\mathsf{T}} \cdot \mathbf{x}_{u}\right)}{\sum_{z \in V}{\exp\left(\mathbf{x}_{z}^{\mathsf{T}} \cdot \mathbf{x}_{u}\right)}}, $$
(2)

where exp(·) stands for the exponential function and \(\mathbf {x}_{v}^{\mathsf {T}}\) is the transpose of xv.

3.4 Pretrained domain embedding space

As the graph-modal data and text-modal data are totally heterogeneous in their original forms, there are two demands for representing each one of them:

  1. 1

    The domain-specific embedding space should preserve meaningful relationships or semantics of the modality;

  2. 2

    The form of the modality representations should be easy to translate from one to the other.

Hence, before learning the cross-modal mapping knowledge, we first apply skip-gram framework [32, 48], one of the most popular techniques in deep learning [49, 50], to pretrain the embedding spaces for each domain. The skip-gram is one of the frameworks in word2vec that tries to represent each word as a vector in a continuous low-dimensional space, where similar words are close to each other. The objective under skip-gram is to maximize the conditional probability of each word and its context.

Graph domain. In the graph domain, we adopt the random walk strategy to generate the node sequences. Same as [9, 10, 33], we then apply the skip-gram framework with negative sampling to map nodes to a continuous vector space. We maximize the likelihood of node sequences to learn the structure regularities in the networks.

Text domain. In this domain, we also utilize skip-gram to train word vectors. And then, we obtain the average of word vectors as the text embedding, which has shown its effectiveness in text representation tasks [18, 51].

3.5 Cross-modal framework

The overall framework of SRGAN is illustrated in Fig. 2. In this particular DDA task, we first use domain-specific encoders (ET and EV) to respectively pretrain text embeddings and network embeddings. Two GANs are then employed to deal with the data in set \(\mathcal {S}\), where xvPV(xv) and xtPT(xt) are the two modal data distributions. To be more precise, \(G_{T}:\mathcal {X}_{T} \mapsto \mathcal {X}_{V}\) encodes text embeddings into fake network embeddings, while DV tries to distinguish them from the real ones in the graph domain. To the contrary, \(G_{V}:\mathcal {X}_{V} \mapsto \mathcal {X}_{T}\) and DT invert the process to regulate t-GAN and prevent model collapse. During the adversarial training processes, reconstruction and construction losses are produced to update the parameters in GANs for the purpose of making the transformed data indistinguishable in the target domain and also capable of holding structural relationships with network homophily.

Fig. 2
figure 2

Illustration of the SRGAN framework. ET and EV are the domain encoders to pretrain the embedding spaces. sT and sV are the scores which evaluate the likelihood of the transformed data to be the target one. The light blue and light green embeddings are the reconstructed data, while the dark ones are the transformed data. The blue arrows denote the text-node-text data flows, while the green ones denote the node-text-node data flows

Content-to-structure. In this framework, t-GAN learns content-to-structure mapping knowledge, the process of which transforms text-modal data to graph-modal data. It aims at reflecting the structural style of a node by its textual information.

Structure-to-content. Unlike image-to-image translation, we interpret the intention of g-GAN as learning the structure-to-content mapping knowledge which provides backward cycle consistency [21] to regulate t-GAN and prevent model collapse. The necessity has been thoroughly discussed in [22].

The least-squares adversarial loss [52] is applied for both GANs to match the distribution of the generated modal data to the data distribution in the target domain. The objective functions are as follows.

$$ \mathcal{L}_{G_{T}}=\frac{1}{2}\mathbb{E}_{\mathbf{x}_{t} \sim P_{T}(\mathbf{x}_{t})}\left[(D_{V}(G_{T}(\mathbf{x}_{t}))-1)^{2}\right], $$
(3)
$$ \mathcal{L}_{G_{V}}=\frac{1}{2}\mathbb{E}_{\mathbf{x}_{v} \sim P_{V}(\mathbf{x}_{v})}\left[(D_{T}(G_{V}(\mathbf{x}_{v}))-1)^{2}\right], $$
(4)
$$ \begin{aligned} \mathcal{L}_{D_{T}}=&\frac{1}{2}\mathbb{E}_{\mathbf{x}_{t} \sim P_{T}(\mathbf{x}_{t})}\left[(D_{T}(\mathbf{x}_{t})-1)^{2}\right]\\ &+\frac{1}{2}\mathbb{E}_{\mathbf{x}_{v} \sim P_{V}(\mathbf{x}_{v})}\left[(D_{T}(G_{V}(\mathbf{x}_{v})))^{2}\right], \end{aligned} $$
(5)
$$ \begin{aligned} \mathcal{L}_{D_{V}}=&\frac{1}{2}\mathbb{E}_{\mathbf{x}_{v} \sim P_{V}(\mathbf{x}_{v})}\left[(D_{V}(\mathbf{x}_{v})-1)^{2}\right]\\ &+\frac{1}{2}\mathbb{E}_{\mathbf{x}_{t} \sim P_{T}(\mathbf{x}_{t})}\left[(D_{V}(G_{T}(\mathbf{x}_{t})))^{2}\right], \end{aligned} $$
(6)

where “1” denotes the real label, and “0” (omitted in Eqs. (5) and (6)) denotes the fake label. We minimize \(\phantom {\dot {i}\!}\mathcal {L}_{G_{T}}\) and \(\phantom {\dot {i}\!}\mathcal {L}_{D_{V}}\) to make the generated modal data to be indistinguishable in the graph-domain embedding space. Meanwhile, we also minimize \(\phantom {\dot {i}\!}\mathcal {L}_{G_{V}}\) and \(\phantom {\dot {i}\!}\mathcal {L}_{D_{T}}\) to enable \(G_{V}(\mathbf {x}_{v}) \in \mathcal {X}_{T}\).

Since the purpose of cross-modal mapping is not only to make the transformed modal data indistinguishable in the target domain, it should hold the network structure of the target node as well. Therefore, besides the adversarial losses, we also apply reconstruction losses and construction losses to optimize both GANs. We adopt the cycle consistency loss (reconstruction loss) [2123] to compute the reconstruction error, the idea of which is to use transitivity to induce the generators to be consistent with each other. The reconstruction loss measures how well the original data is reconstructed after a transit generative sequence. Meanwhile, the construction loss is also needed to measure the similarity of relationships between the transformed data and the original one.

$$ \begin{aligned} &\mathbf{x}_{v}^{*}=G_{T}(\mathbf{x}_{t}),\\ &\mathbf{x}_{t}^{*}=G_{V}(\mathbf{x}_{v}), \end{aligned} $$
(7)
$$ \begin{aligned} &\mathbf{x}_{t}^{*'}=G_{V}(\mathbf{x}_{v}^{*}),\\ &\mathbf{x}_{v}^{*'}=G_{T}(\mathbf{x}_{t}^{*}), \end{aligned} $$
(8)

where \(\mathbf {x}_{v}^{*}\) and \(\mathbf {x}_{t}^{*}\) in Eq. (7) are the transformed data, and \(\mathbf {x}_{t}^{*'}\) and \(\mathbf {x}_{v}^{*'}\) in Eq. (8) are the reconstructed data. Thus, the cycle consistency loss is defined as follows.

$$ \mathcal{L}_{cyc}=\mathbb{E}_{\mathbf{x}_{t} \sim P_{T}(\mathbf{x}_{t})}\left[\|\mathbf{x}_{t}^{*'}-\mathbf{x}_{t}\|{~}_{1}\right] +\mathbb{E}_{\mathbf{x}_{v} \sim P_{V}(\mathbf{x}_{v})}\left[\|\mathbf{x}_{v}^{*'}-\mathbf{x}_{v}\|{~}_{1}\right], $$
(9)

where l1 norm is applied in the loss, and we push GV(GT(xt))≈xt and GT(GV(xv))≈xv in the cycle by minimizing \(\mathcal {L}_{cyc}\).

The construction loss is defined in Eq. (10):

$$ \mathcal{L}_{con}=\mathbb{E}_{\mathbf{x}_{t} \sim P_{T}(\mathbf{x}_{t})}\left[\|\mathbf{x}_{t}^{*}-\mathbf{x}_{t}\|{~}_{1}\right] +\mathbb{E}_{\mathbf{x}_{v} \sim P_{V}(\mathbf{x}_{v})}\left[\|\mathbf{x}_{v}^{*}-\mathbf{x}_{v}\|{~}_{1}\right]. $$
(10)

To satisfy Eq. (1), we minimize \(\mathcal {L}_{con}\). Let

$$ \begin{aligned} \mathcal{L}_{u \to v} &=\lvert P(G_{T}(\mathbf{x}_{t})|\mathbf{x}_{u})-P(\mathbf{x}_{v}|\mathbf{x}_{u})\rvert \\ &\propto \lvert \exp\left(G_{T}^{\mathsf{T}}(\mathbf{x}_{t}) \cdot \mathbf{x}_{u}\right)-\exp\left(\mathbf{x}_{v}^{\mathsf{T}} \cdot \mathbf{x}_{u}\right)\rvert\\ &=\exp\left(\mathbf{x}_{v}^{\mathsf{T}} \cdot \mathbf{x}_{u}\right) \cdot \lvert \exp\left(\left(G_{T}^{\mathsf{T}}(\mathbf{x}_{t}) - \mathbf{x}_{v}^{\mathsf{T}}\right) \cdot \mathbf{x}_{u}\right)-1\rvert \\ &\propto \lvert \exp\left(\left(G_{T}^{\mathsf{T}}(\mathbf{x}_{t}) - \mathbf{x}_{v}^{\mathsf{T}}\right) \cdot \mathbf{x}_{u}\right)-1\rvert, \end{aligned} $$

if \(\|\mathbf {x}_{v}^{*}-\mathbf {x}_{v}\|{~}_{1} \to 0\), then \(\mathcal {L}_{u \to v} \to 0\).

Besides, minimizing |P(xu|GT(xt))−P(xu|xv)| is equal to minimizing |P−1(xu|GT(xt))−P−1(xu|xv)|. Thus, let

$$ \begin{aligned} \mathcal{L}_{v \to u} &=\lvert P^{-1}(\mathbf{x}_{u}|G_{T}(\mathbf{x}_{t}))-P^{-1}(\mathbf{x}_{u}|\mathbf{x}_{v})\rvert\\ &=\lvert \sum_{z \in V}\exp\left(\left(\mathbf{x}_{z}^{\mathsf{T}}-\mathbf{x}_{u}^{\mathsf{T}}\right) \cdot G_{T}(\mathbf{x}_{t})\right)\\ &\quad\quad -\sum_{z \in V}\exp\left(\left(\mathbf{x}_{z}^{\mathsf{T}}-\mathbf{x}_{u}^{\mathsf{T}}\right) \cdot \mathbf{x}_{v}\right)\rvert\\ &=\lvert \sum_{z \in V} \exp\left(\left(\mathbf{x}_{z}^{\mathsf{T}}-\mathbf{x}_{u}^{\mathsf{T}}\right) \cdot \mathbf{x}_{v}\right) \\ &\quad\quad \cdot \left[\exp\left(\left(\mathbf{x}_{z}^{\mathsf{T}}-\mathbf{x}_{u}^{\mathsf{T}}\right) \cdot (G_{T}(\mathbf{x}_{t})-\mathbf{x}_{v})\right)-1\right]\rvert, \end{aligned} $$

if \(\|\mathbf {x}_{v}^{*}-\mathbf {x}_{v}\|{~}_{1} \to 0\), then \(\mathcal {L}_{v \to u} \to 0\).

The full objective of SRGAN is:

$$ \mathcal{L}=\mathcal{L}_{G_{T}}+\mathcal{L}_{G_{V}}+\mathcal{L}_{D_{T}}+\mathcal{L}_{D_{V}} +\alpha\mathcal{L}_{cyc}+\beta\mathcal{L}_{con}, $$
(11)

where the hyper-parameters α and β are the factors controlling the contributions of the reconstruction and construction losses, respectively.

3.6 Edge representation learning

Hadamard product [53] is adopted for learning the edge representations from content-to-structure knowledge, which experimentally shows its effectiveness in [10]. For example, given two nodes v,u with textual information, the edge representation is defined as \(\mathbf {x}_{e(v,u)}=\mathbf {Hadamard}(\mathbf {x}_{v}^{*},\mathbf {x}_{u}^{*})\), where xe(v,u) denotes the linking relationship of v,u translated from the corresponding texts.

4 Experiments

In this section, several experiments are conducted to validate the performances of SRGAN and baseline methods.

4.1 Datasets

The following three real-world network datasetsFootnote 1 are used in the experiments, and the detailed statistics of these datasets are listed in Table 1.

Table 1 Statistics of the datasets

CitNetFootnote 2. This dataset was extracted by Tang et al. [54] where each paper is regarded as a node, and every directed edge between two nodes denotes a citation. We obtain 132,033 papers with abstract contents after filtering and split them into training and testing sets by a ratio of 70:30.

CoraFootnote 3. A typical benchmark [55] for text-based social network analysis. All of the articles are divided into 10 root categories. Also, 70% of the articles are chosen to be the training data, and the rest is for testing.

HepThFootnote 4. A filtered graph dataset [16, 56] extracted from the e-print arXiv, where directed edges denote the citation relationships. Following the same rule, nodes with textual information are split into training and testing sets as well.

4.2 Baselines

The DDA methods (standard GAN and LSGAN), structure-based NRL methods (DeepWalk, node2vec, and AIDW), and text similarity methods (Jaccard and CosSim) are employed to demonstrate the effectiveness of SRGAN. Subsection 4.4 will present the transformation quality of each DDA method in neighborhood preserving. To evaluate the performance of translating social relations from text-domain information, in subsection 4.5, we also apply Hadamard product for all DDA methods and structure-based NRL methods. Text similarity methods measure the similarity of texts straightforward. Detailed descriptions of all baselines are as follows.

GAN. A classic GAN [20] (standard GAN) learns two competing mappings: a discriminator and a generator, both of which are modeled as deep neural networks. They play a min-max game where the discriminator tries to identify the fake network embeddings and the generator tries to produce the examples as real as possible.

LSGAN. Least squares GAN [52] is able to generate samples that are closer to real data. It adopts the least squares loss function for the discriminator to move the fake samples toward the decision boundary, which also performs more stable during the learning process.

DeepWalk. This online learning approach learns low-dimensional latent representations of nodes from the samples yielded by short random walks [9], which is scalable to build incremental results. Note that, for nodes without structural information, we establish self-links to learn the network representations due to the reason that intuitively a node should have the closest relationship with itself. We switch hierarchical softmax to negative sampling for improving the efficiency [10].

node2vec. It simulates breadth-first sampling and depth-first sampling by tunable parameters p and q. According to the parameter sensitivity experiments [10], we set p=q=0.5 for balancing outward exploration and a proper distance from the start vertex. Same as DeepWalk, we establish self-links for those nodes without structural information both in the training and testing stages.

AIDW. ANE with inductive DeepWalk (AIDW) [37] unifies a structure-preserving component and an adversarial-learning component alternatively to train the generator. The former component aims at encoding structural properties, while the latter acts as a regularizer for learning more stable and robust representations based on the adversarial learning principle.

Jaccard. Jaccard similarity coefficient [57], also known as Intersection over Union, is used for measuring text similarity between finite sentence sample sets. It is defined as the size of the word intersection divided by the size of the word union of the sample sets.

CosSim. Cosine Similarity measures the cosine of the angle between two text embeddings. It was adopted to build the document network dataset by Wang et al. [34]. Same as SRGAN, each text is also represented by the average of the word vectors.

4.3 Experiment settings

The embedding size for both modalities is set to 300, which is the same size as Google pretrained word vectorsFootnote 5. In random-walk procedure, we empirically set the length of a random path to 30 and the iteration time for each node to 20. Following the settings [37], AIDW initializes the input data with the pretrained node embeddings by DeepWalk and applies one layer for the generator and 512-512-1 layer structure for the discriminator. Due to the pretrained data forms, all DDA methods apply deep neural networks for linear transformation of the input data, and Adam [58] is employed as the optimization algorithm for the neural networks. Instead of using ReLU and Leaky ReLU [5961] in image-to-image translation, we adopt hyperbolic tangent (tanh) as the activation function, and so do the other DDA methods. For regularization, we employ dropout (rate = 0.3) [62] for the generators of GANs. Deep neural networks composed of an input layer, several hidden layers and an output layer are employed for both generators (300-600-300-300) and discriminators (300-600-300-300-1). The weights of neural networks are initialized from a Gaussian distribution with mean 0 and standard deviation 0.02. The best-performed hyper-parameters α and β in the space of {0.1,0.3,0.5,0.7,0.9} are selected by applying the grid search strategy.

4.4 Neighborhood preserving quality

In this subsection, we evaluate the transformation performance of each DDA method. \(Q_{l_{1}}\) and \(Q_{l_{2}}\) are the metrics used to validate the quality of the transformed data.

$$ \begin{aligned} &Q_{l_{1}}=\mathbb{E}_{v \sim V^{\prime}}[\mathbb{E}_{o \sim N(v)}\lvert\|\mathbf{x}_{v}^{*}-\mathbf{x}_{o}\|{~}_{1}-\|\mathbf{x}_{v}-\mathbf{x}_{o}\|{~}_{1}\rvert],\\ &Q_{l_{2}}=\mathbb{E}_{v \sim V^{\prime}}[\mathbb{E}_{o \sim N(v)}\lvert\|\mathbf{x}_{v}^{*}-\mathbf{x}_{o}\|{~}_{2}-\|\mathbf{x}_{v}-\mathbf{x}_{o}\|{~}_{2}\rvert], \end{aligned} $$
(12)

where N(v) denotes the neighbors of node v. \(Q_{l_{1}}\) applies l1 norm and \(Q_{l_{2}}\) applies l2 norm.

As discussed in Section 3.3, for vV, we expect that the transformed data \(\mathbf {x}_{v}^{*}\) could hold the structural relationships of xv with network homophily in the graph-domain space. Thus, the idea of the proposed metrics in Eq. (12) is to measure the similarity of the neighborhood-preserving distance. If \(Q_{l_{1}}\) and \(Q_{l_{2}}\) are relatively smaller, then we can conclude that the transformed data preserves much more similar relationships with the original one in its neighborhood. Figures 3 and 4 show the results of the transformation quality in neighborhood preserving. Apparently, SRGAN outperforms the standard GAN and LSGAN on all three datasets. Despite the different adversarial learning processes between the two state-of-the-art GAN methods, the performances in neighborhood preserving are almost the same. However, SRGAN achieves obviously smaller \(Q_{l_{1}}\) and \(Q_{l_{2}}\) distances. We reduce more than 18% of \(Q_{l_{1}}\) and \(Q_{l_{2}}\) distances on CitNet, 5% on HepTh, and almost a half on Cora.

Fig. 3
figure 3

Results of \(Q_{l_{1}}\) distance on the three datasets in neighborhood preserving over the DDA methods

Fig. 4
figure 4

Results of \(Q_{l_{2}}\) distance on the three datasets in neighborhood preserving over the DDA methods

The state-of-the-art GAN methods only consider the adversarial loss in modality transformation. They just try to make the generated data indistinguishable in the graph-domain space but neglect to preserve structural relationships in networks. However, as the experimental results present, SRGAN can narrow down the difference between the transformed data and the original one in the relationships with neighbors. We think this is mainly because of the framework of SRGAN, where the cycle learning process with reconstruction and construction losses provides the cross-modal mapping knowledge that helps diminish \(Q_{l_{1}}\) and \(Q_{l_{2}}\) distances.

4.5 Relation inference from text-domain information

To validate the effectiveness of all methods in translating social relations, we conduct link prediction experiments based on text-domain information. We construct positive samples labeled with “1” by selecting all edges (v,u)∈E, and negative ones labeled with “0” by randomly generating node pairs (v,u)∉E. We employ l2-regularized logistic regression [63, 64] implemented using scikit-learn Footnote 6 to train the classifiers based on the original network structures for evaluating the methods. We aim to predict whether there exists a link between two given nodes and thus following [65], the metrics adopted for performance evaluation include Micro- F1 and Macro- F1. Micro- F1 sums up the individual true positives, false positives, and false negatives of the dataset for different classes, while Macro- F1 calculates the average of the precision and recall values of the dataset on different classes and finds their unweighted mean.

Intuitively, when facing with no explicit structures of the text-based networks, it would be easy to consider measuring the similarity of two nodes by their textual information. Therefore, we employ two commonly used text similarity methods, the Jaccard similarity coefficient and cosine similarity, to show the baseline results of relation inference based on texts. The threshold is set to 0.5, which is the same as in the l2-regularized logistic regression classifiers. If Jaccard similarity coefficient or cosine similarity score between two texts is larger than 0.5, then we infer that there exists a link between the corresponding nodes. Besides, due to the scalability of DeepWalk, node2vec, and AIDW, we also investigate the effectiveness of the three state-of-the-art NRL methods when dealing with the situation of missing structures. Tables 2, 3, and 4 show the performances of all methods, and the numbers in bold represent the best results.

Table 2 Micro- F1 and Macro- F1 scores on CitNet dataset in link prediction (α=0.1,β=0.1)
Table 3 Micro- F1 and Macro- F1 scores on Cora dataset in link prediction (α=0.3,β=0.1)
Table 4 Micro- F1 and Macro- F1 scores on HepTh dataset in link prediction (α=0.9,β=0.1)

Table 2 demonstrates the superiority of SRGAN in comparison against the other seven methods. First, we can see that the DDA methods achieve a significant improvement in relation inference based on textual information. The best Micro- F1 and Macro- F1 scores on the CitNet dataset produced by SRGAN are 0.8340 and 0.8265, respectively. Meanwhile, we find that standard GAN, LSGAN, and SRGAN perform stable when varying the percentages of the training edges. Second, we can conclude that the structure-based NRL methods (DeepWalk, node2vec, and AIDW) are unsuitable in the situation of missing structures. To our surprise, the edge representations generated by DeepWalk and node2vec make the classifiers predict all potential links negative regardless of the different percentages of the training edges. We think the reason is that the online training process barely learns meaningful structural information for those data in the testing sets. Therefore, the testing nodes might be mapped to the position where they are incapable of holding the proper structural relationships. Though AIDW performs slightly better than DeepWalk and node2vec, it still suffers from unknown structures when generating new embeddings. Third, the naive thought of inferring potential relations by measuring the similarity of textual information cannot achieve expected results. Cosine similarity performs the worst and Jaccard is just slightly better than DeepWalk, node2vec, and AIDW.

As the results presented in Table 3, SRGAN increases more than 5% in Micro- F1 and Macro- F1 scores, respectively, compared with the other two DDA methods. The standard GAN and LSGAN are considered indistinguishable on the Cora dataset. We find that DeepWalk and node2vec fail to produce valid results because they make the classifiers predict all relations nonexistent again. AIDW performs much better than the other two NRL methods, but it cannot predict meaningful relationships even though the node representations are enhanced. Jaccard is better than cosine similarity on Cora dataset, but still it cannot infer convincing social relations. SRGAN outperforms all baseline methods and achieves the highest Micro- F1 (0.7905) and Macro- F1 (0.7887) scores involving 90% of training edges. Still, the DDA methods perform steady when varying the percentage of training edges from 10 to 90%.

Table 4 shows the results on HepTh dataset. SRGAN also produces the best Micro- F1 (0.8004) and Macro- F1 (0.7973) scores giving 50% of training edges. LSGAN turns out to be competitive since in most cases, it shows a better performance compared with the standard GAN. DeepWalk and node2vec seem to be unfit for inferring social relations in the network-structureless situation. We think the two structure-based methods cannot take the benefit from pretrained embedding models, which leads to generate meaningless edge representations. As a consequence, all edges in the testing sets are predicted as invalid connections. Though AIDW makes positive predictions, few of them are correct. Jaccard similarity coefficient and cosine similarity are still uncompetitive with our proposed approach.

4.6 Sensitivity of hyper-parameters

We evaluate the sensitivity of hyper-parameters α and β in the space of {0.1,0.3,0.5,0.7,0.9} in this subsection. Figures 5 and 6 present the results.

Fig. 5
figure 5

Micro- F1 scores on the three datasets in hyper-parameter sensitivity evaluation

Fig. 6
figure 6

Macro- F1 scores on the three datasets in hyper-parameter sensitivity evaluation

With the increase of α and β, the performance of SRGAN gradually approaches to DeepWalk and node2vec on the CitNet and Cora datasets. It seems to be inappropriate to set the hyper-parameters too large in SRGAN. When α=0.1 and β=0.1, SRGAN achieves its best results on CitNet. For the Cora dataset, SRGAN produces the highest results when α increases to 0.3. Different from the two datasets, there is a slight fluctuation approximately around 0.77 in both metrics on HepTh. SRGAN performs competitive compared against other methods and the best results are achieved at (α=0.9,β=0.1). Hence, we think that in most cases, it would be more effective for SRGAN to use a relatively smaller β. Depending on the applications, α still needs to be fine-tuned.

4.7 Efficiency of SRGAN

To demonstrate the efficiency of SRGAN, we conduct the experiments on the speed of SRGAN during the training stage. For each dataset, the batch size is set to 64. Two graphics processing units (GPUs, NVIDIA Tesla K80 Footnote 7) are deployed to accelerate model training.

Table 5 shows the efficiency of SRGAN on the three datasets, where Tall denotes the total time for training SRGAN per iteration (epoch), T1 denotes the time that SRGAN completes the first batch loop, and Tavg denotes the average time for the rest per batch. Due to the GPU accelerators, the speed of training SRGAN on the three datasets is fast. Tavg is only around 0.05s for all datasets. Even dealing with the large CitNet network, we can still finish training SRGAN within 87s per iteration.

Table 5 Efficiency of SRGAN on the three datasets

4.8 Discussions

Overall, the performance of SRGAN is superior to the state-of-the-art DDA methods (standard GAN and LSGAN), structure-based NRL methods (DeepWalk, node2vec, and AIDW), and text similarity methods (Jaccard similarity coefficient and cosine similarity). SRGAN makes the transformed data preserve much more similar relationships with the original one in its neighborhood. Meanwhile, the results of relation inference produced by SRGAN show that even if lacking some of the structural information, we can still make valid prediction according to text-domain information. Thus, SRGAN outperforms the baseline methods in making the transformed data reflect the sociability and the tendency to associate in or to form social connections of the graph-structured data. The standard GAN and LSGAN only consider adversarial losses in modality transformation, which cannot well-preserve the network structures. It is difficult for DeepWalk and node2vec to learn meaningful embeddings without explicit structures, since the online learning process cannot transfer valid information of the pretrained network relationships to those unseen nodes. AIDW enhances the robustness of node representations, but it still struggles with the network-structureless situations. Jaccard similarity coefficient and cosine similarity methods infer the relationships only based on text resources, which in our experiments proves that the complex real-world social relations cannot be simply inferred according to the similarity of their text data.

Compared with the aforementioned baselines, the effectiveness of SRGAN can be explained as follows.

  1. 1

    In our cross-modal mapping framework, not only do we use the adversarial losses to deceive domain discriminators, but also reconstruction and construction losses are applied to learn textual and topological styles, which depict the structure-aware relationships with the network homophily;

  2. 2

    SRGAN incorporates the knowledge from the text domain which remedies the network-structureless situation where structure-based NRL methods cannot be well-performed;

  3. 3

    Unlike the text similarity methods that only consider text resources, in social relation translation, SRGAN also takes the advantages of the original graph data. The cross-modal mapping knowledge we learn bridges the text-modal data and graph-modal data, which helps infer meaningful social relations.

However, this task is still challenging that demands for more research efforts. In the experiments, we find that the generated network data cannot perfectly imitate the “manners” of real one in the original network space (i.e., \(Q_{l_{1}}\) and \(Q_{l_{2}}\) distances still have some rooms to be reduced). We think SRGAN can preserve “relative” relationships (social connections) based on textual information, but it is challenging to locate the “absolute” position of a node in the graph-domain space. The reason may lie in that the networks might be generated from diverse social information, where the utilized textual information might just be a part of the key components that lead to construct some of the topologies or interactions. Therefore, it would be hard to accurately locate an unseen node in the original network space only by such textual information.

5 Conclusion and future work

In this paper, we propose social relation GAN (SRGAN) which tries to remedy for most of the existing structure-based NRL techniques that have difficulties in dealing with text-based network-structureless problems. The cross-modal mapping framework bridges the gap between the graph-modal data and text-modal data, which helps learn meaningful relations from the text-domain information in networked systems. Experimental results on three text-based network benchmarks show that SRGAN can translate more realistic social relations compared against the baselines.

In future work, we will consider incorporating other multimedia data like images, videos, etc., to analyze such a network-structureless situation. Also, we believe it is possible to generate meaningful text-based profiles from the graph-modal data, which could provide more information for some human-centered applications such as recommendations and detections.