Introduction

Learning from non-euclidean data [1] has gained a lot of scientific attention in recent years. Among those data, learning from network structured data is one challenging direction which has diverse applications in fields like recommender systems [2], computational social systems [3], text mining [4], service oriented and content delivery networks [5, 6], and systems biology [7]. With the success of deep learning in various domains, those methods became prominent and fruitful in learning network representations which eventually lead to the development of subdomain in machine learning named as network embedding or network representation learning (NRL) [8,9,10,11]. Initial works in this domain were based on unsupervised learning using skip-gram neural network architectures, followed by deep neural networks. Another research direction was to use conventional convolutional neural network architectures to learn network representations. Applying the traditional convolution operation on graphs is found to generate sub-optimal results as the network structure is highly irregular. Substantial developments in the field of NRL occurred with the proposal of graph convolutions which is very effective when the input is highly irregular. A plethora of works [12,13,14,15] has been proposed based on graph convolutions which can be mainly classified into spectral and spatial graph convolution-based methods. Among those methods, our work particularly focus on the direction of graph convolutional network [13] because of its wide popularity and effectiveness.

Fig. 1
figure 1

Receptive field of Node A

The basic principle of GCN is to learn the representations of each node by aggregating the features of its first-order neighbours through a parameterized learning mechanism. The receptive field of each node includes the immediate one-hop neighbours, which is shown for node A in Fig. 1. GCN has proved to be very successful in many state-of-the-art network mining tasks like node classification and link prediction. Some variants have been already proposed for GCN which makes the model more robust and scalable.

The basic GCN model is designed only to work with static networks, where the time-varying nature of the network is not considered. However, in real world, most of the networks are either evolving in nature or they carry temporal information in their edges. We call them either as a dynamic network which is represented as network snapshots where the importance is given to their evolving nature, or a temporal network which is represented as network with time-stamped edges where the importance is given to the change in their connectivity patterns w.r.t. time. In this work, we focus on the temporal aspect where the input graph is represented with edges which carry temporal information. Telephone call networks, email communication networks, disease spread networks, etc. are some typical examples for a temporal network. The notion of temporal edge is highly sensitive when we model virus spread as a network, because the temporal ordering of contacts is an important factor to be considered while modelling the spread of disease from one person to other. The temporal relationships between the nodes are an important property to be preserved while embedding the nodes in a temporal network to the vector space.

Some works [16,17,18] focused on network embedding from dynamic networks which consider network snapshots at consecutive time-steps as input. However, network embedding from temporal network has received less attention apart from a few works [19] which used unsupervised approaches using random walk and skip-gram neural architectures. As graph convolutional network being the most prominent and effective method for network embedding, we aim to develop a method based on GCN which can learn node embeddings by considering the temporal information present in the network. A temporal network and the receptive field of node A ( given in circle) are shown in Fig. 2. For node A, the receptive field includes the one-hop neighbours (B, C, D, E) along the nodes (H, I) which are in the temporal neighbourhood (time ordered path exist) of A. Nodes F and G are ignored as temporal neighbours as they cannot be reached using a time-respecting walk from A. Considering this temporal receptive field while learning node embeddings can generate more useful representations, which is the primary motivation of this research.

Fig. 2
figure 2

A temporal network

In GCN, the feature aggregation at each node assumes that all neighbours have same importance which may reduce the model capacity. Graph attention network (GAT) [20] is an approach which showed that, by providing attention to neighbours which can be learned by end-to-end training, we may build a more robust model. In this work, we follow the attention mechanism as suggested by GAT so to include the hypothesis that different temporal neighbours may contribute differently during the aggregation process. Both GCN and GAT consider the edge distribution of the network to be static and are not well suited for representation learning from temporal network whose edge distributions vary over time. Moreover, they focus on preserving the first-order neighbourhood while generating embeddings. In the case of temporal networks, temporal order proximity is another important property to be preserved while generating node embeddings.

The main contributions of the paper are as follows.

The work addresses the problem of generating node embedding from temporal networks.

The work provides a methodology to incorporate temporal information into a graph attention network for generating time-aware node embeddings.

A graph autoencoder based on proposed method is designed which can perform link prediction on real-world temporal networks .

To the best of our knowledge, this is the first work which apply GCN concepts over temporal network data represented as edge streams.

The rest of the paper is organized as follows. “Related works” and “Definitions and preliminaries” cover the related works and preliminary concepts, respectively. “Methodology” presents the proposed methodology. “Experimental setup” and “Result and analysis” discuss the experimental setup and the results, respectively. Finally, “Conclusion” presents the conclusion and future works.

Related works

Machine learning has made tremendous improvements in various areas like speech recognition [21, 22], object detection [23, 24], and text mining [25, 26]. With the rise of large-scale social networks, a substantial amount of research has been conducted around machine learning with network or graph-structured data. As feature (representation) learning being an important task in machine learning pipeline, methods for learning good representations of nodes in a network gained importance. The emergence of deep learning accelerated the growth of this typical research area, and various methods were proposed for network representation learning based on neural architectures. It includes works based on skip gram architectures [27, 28], deep neural networks [29, 30], and graph neural networks [31]. In this work, we particularly focus on graph neural networks (GNN) which has its roots in the field of signal processing with graphs [32].

The basic GNN aimed to extend the traditional neural network concepts to work with data represented in network domain. Among the variants of GNN, works based on graph convolutions [33, 34] gained wide popularity. They can be mainly classified into spectral and spatial graph convolution-based methods [15]. Spectral methods include graph filtering-based approaches [35] along with some methods to reduce the computational complexity [12]. Spatial methods [14, 37] perform feature aggregation using the local neighbourhood of every node which is relatively simpler compared to spectral methods. The authors of GCN [13] provided a simplified method to compute the spectral graph convolution by computing its first-order approximation which can be considered as the most practical approach towards the problem. GCN has proved to be very effective in various domains like text classification [36], recommender systems [37], relational data modeling [38], and image recognition [39], along with various other scientific areas [40, 41]. FASTGCN [42] is modification over GCN which attempts to reduce the training time of GCN. Graph attention network (GAT) [20] is enhancement over GCN which uses an additional attention layer to learn the importance of node neighbourhood during feature aggregation. Another direction of research on GCN was to efficiently design convolutions that can reach higher order neighbours [43, 44]. Researchers developed different variants to GCN which can work with more complex settings like heterogeneous networks [45, 46], signed networks [47], and hypergraphs [48].

All the works discussed above focus on static network where the nodes and edges do not change over time. Some works are already done on time-varying network embedding where most of them aimed at embeddings network snapshots that evolved with time. Deep Embedding Method for Dynamic Graphs (DynGEM) [17] used as stacked denoising autoencoder that can incrementally learn the temporal dynamics of a dynamic network. Tempnode2vec [49] generates PPMI matrices from individual network snapshots, factorizes the PPMI matrices, and optimizes a joint loss function to align the node embeddings and captures temporal stability. Dynnode2vec [50] extends the skip-gram architecture of node2vec so as to work with dynamic network snapshots. DyRep [51] considers both topological evolution and temporal interactions, and aims to develop embeddings which encode both structural and temporal information. EvolveGCN [18] extends GCN to dynamic networks by modelling the evolution of GCN parameters using a recurrent neural network. Combining graph convolution layers with LSTM layers [52] is another direction for generating dynamic node embeddings. A brief survey on modelling of dynamic networks using dynamic graph neural networks can be found on [53]. On the other hand, temporal network [54, 55] whose edge connectivity varies over continuous time has been less studied from a network embedding perspective. One state-of-the-art work in this direction is continuous-time dynamic network embeddings (CTDNE) [19, 56] which is a framework to adapt random walk and skip-gram-based approaches like deepwalk to temporal networks. CTDNE optimizes a skip-gram objective, so that the node that is closer in the temporal walk occupies closer when mapped to the vector space. In the proposed work, the concept of temporal random walk [56, 57] is used to capture the temporal information of the network and positive pointwise mutual information (PPMI) [58] to compute the temporal proximity between vertex pairs.

Various approaches [59,60,61] were proposed to perform link prediction on dynamic networks. Recent studies [18, 19, 62, 63] show that the link prediction performance can be improved to a great extend using network embedding methods. A graph autoencoder [64] designed using GCN as the encoder is particularly effective in reconstructing the original network and thereby predicting the missing links. A summary of some related works is presented in Table 1. In the context of temporal networks discussed in this work, the link prediction task is to predict the links that may occur as edge streams at a later point of time.

Table 1 Summary of related works

Definitions and preliminaries

In the section, we provide the various definitions used in this work along with the problem definition of temporal network embedding. We also review the preliminary design of GCN and GAT.

Definitions

Temporal network  [54] It is a graph \(G = (V, E_T, T)\), where V is a set of vertices, \(E_T\) is the set of time-stamped edges between vertices in V, and T is the set of possible time-steps. Each instance of the network can be represented as the contact between the vertices as a function of time: a set of triplets \((v_{i}, v_{j}, t)\) where t is the time of interaction between vertex \(v_{i}\) and \(v_{j}\), with \(t \in T\). At the finest granularity level, each edge may be labeled with a distinct time-stamp. Also, there can be multiple edges between nodes representing the interaction between vertices at different time-stamps. For example, an email communication may occur between two people at different time-stamps.

Temporal walk [56]: A temporal walk exists from node \(v_i\) to \(v_j\) if there exist a stream of edges \(E = (v_{i}, v_{k}, t_1), (v_{k}, v_{l}, t_2), \ldots ,(v_{n}, v_j, t_n)\) from \(v_i\) to \(v_j\)m such that with \(t_1 \le t_2 \le \cdots \le t_n\). Informally if we have two edges \((v_{1}, v_{2}, t_1)\) and \((v_{2}, v_{3}, t_2)\), a temporal walk starts at \(v_1\) can reach \(v_3\) only of \(t_1 \le t_2\). We can say that two vertices u and v are temporally close if there exists a time-respecting walk from u to v. We may also define the temporal order proximity between any two vertices \(v_i\) to \(v_j\) as length of the minimal length temporal walk between the vertices \(v_i\) and \(v_j\)

Therefore, in the case of temporal networks, a walk sequence is valid only if the edges are traversed in the increasing order of interaction time. If each edge represents an interaction (e.g., phone call communication) between two objects, then a temporal random walk represents an optimal path for an information transfer through the temporal network. For example, suppose we have two phone calls \(e_i= (v_1,v_2, t_1)\) from \(v_1\) to \(v_2\) at time \(t_1\) and \(e_j = (v_2,v_3,t_2)\) from \(v_2\) to \(v_3\) at time \(t_2\); then if \(t_1 \le t_2\), the call \(e_j = (v_2,v_3)\) may reflect the information received during the call \(e_i = (v_1,v_2)\). However, if \(t_1\) > \(t_2\), the call \(e_j = (v_2,v_3)\) cannot contain any information communicated between \(v_1\) and \(v_2\).

Temporal network embedding: Given a temporal network \(G = (V, E_T, T)\), the task is to learn a transformation function \(f:V_i \rightarrow K_i \in R^d\), where \(d \ll \vert V \vert \), which map the nodes of the network to a low-dimensional space by preserving the network structure and the temporal order proximity of nodes in the network. Two nodes occupy closer in the vector space if they are topologically closer in the network and there exists a high temporal closeness between the nodes.

Graph convolutional network (GCN) [13]

Convolutional neural network is a widely used concept to learn good feature representations from data that can be represented in a euclidean space. Network are inherently irregular or non-Euclidean and the traditional convolution operation as applicable in images is not directly applicable to networks. The concept of graph convolution has been evolved from the field of graph signal processing where we classify the whole domain into spectral and spacial-based graph convolutions. Under spectral domain, the node features or attributes of the graph are considered as graph signals and the Chebyshev polynomials of the diagonal matrix of Laplacian eigenvalues are considered as the graph kernel. Spectral convolutions are defined as the product of a graph signal by a kernel. The kernel defined by the Chebyshev uses the sum of all Chebyshev polynomial kernels applied to the diagonal matrix of scaled Laplacian eigenvalues for first-order to k-order (largest order) neighbourhood. GCN can be considered as a first-order approximation of spectral graph convolution as described by Chebyshev. Given a graph represented as an adjacency matrix A with n number of nodes and k number of features, the goal of GCN is to learn a function from the network which takes as input: i) an n x k feature matrix H; ii) an adjacency matrix A, and to produce an output Z which is an n x d matrix, where d is the number of dimensions per node. GCN uses the layer-wise propagation rule:

$$\begin{aligned} H_{(l+1)} = \sigma (\hat{D}^{-\frac{1}{2}}\hat{A} \hat{D}^{-\frac{1}{2}} H_{(l)} W_{(l)}), \end{aligned}$$
(1)

where \(W_l\) denote the weight matrix of \(l^{th}\) network, \(\hat{A}=A+I\) and D is the diagonal node degree matrix of \(\hat{A}\). Initially, the node feature matrix F can be considered as the embedding matrix, i.e., \(H_0 =F\). At each convolutional layer l, the embedding matrix gets updated by following three steps which include a feaure propagation, a linear transformation, and a non-linear activation.

Fig. 3
figure 3

Flow diagram for PPMI matrix generation

Fig. 4
figure 4

Temporal Hops w.r.t. node A

During feature propagation, the new features of each node become the sum of the features from the node’s first-order neighbourhoods. This is followed by multiplying the result with the weight matrix which can linearly transform the feature representations to another latent space. The last step is to apply a non-linear activation function. The results of the convolutions may be fed into a softmax layer and weights can be learned using application-specific tasks like link prediction or node classification.

Graph attention network (GAT) [20]

GCN assumes equal importance to all the nodes in the same neighbourhood and, therefore, have limitations in model capacity. A GAT is designed to be a more robust model which uses an attention mechanism to assign different weights to nodes in the same neighbourhood. Moreover, the attention weights can be learned using end-to-end training which can improve the model performance. In addition to the steps followed in GCN, the core of the GAT is an attention layer which learns self-attention weights that indicate the importance of node j features to node i. GAT also takes as input an n x k feature matrix represented as F, an adjacency matrix A, and produces Z, an n x d embedding matrix. Like GCN, the initial step is to apply a linear transformation parameterized by a weight matrix w over all nodes to generate a high-level latent representation. The next step is to perform a self-attention, parameterized by a weight vector \(\overrightarrow{a}\) over all nodes and learns an attention coefficient between every pair of vertices. Given the feature matrix H ={\(h_1,h_2,...h_n\)} (initially \(H=F\)) where each row \(h_i\) represents the feature vector of node i, the attention coefficient \(E_{ij}\) between every node pair i and j can be computed as:

$$\begin{aligned} E_{ij}= \overrightarrow{a}(Wh_i,Wh_j). \end{aligned}$$
(2)

GAT only considers \(E_{ij}\) for those nodes \( j \in N_i\), where \(N_i\) represent the first-order neighbourhood of node i in the graph. GAT uses a leaky relu function to introduce some non-linearity and a softmax function to normalize the attention coefficients. The coefficients thus computed can be represented as:

$$\begin{aligned} E_{ij}=\frac{{exp(leakyRELU(\overrightarrow{a^T}(Wh_i||Wh_j)) }}{ \sum _{l \in N_{i}}exp(leakyRELU(\overrightarrow{a^T}(Wh_i||Wh_l)) }, \end{aligned}$$
(3)

where || represent the concat operation. The matrix \(E_{ij}\) constitutes the attention matrix. Furthermore, the link information in the adjacency matrix is replaced by the learned attention coefficients from the attention matrix and the features from the first-order neighbourhood are aggregated. A typical feature aggregation at node i can be represented as:

$$\begin{aligned} h_{i}'= \sigma \left( \sum _{j \in N_{i}}E_{ij}W h_j\right) . \end{aligned}$$
(4)

Methodology

In this section, we discuss a methodology to incorporate temporal information to a graph convolutional network so as to generate time-aware node embeddings from a temporal network.

Extracting temporal proximity information

The first step of the process is to extract the temporal neighbourhood information of every node in the network. Here, we propose a temporal walk and PMMI-based method to efficiently represent the temporal neighbourhood of each node. A flow diagram of the method is shown in Fig. 3. We perform a temporal random walk starting at various nodes on the network to generate walk sequences. Furthermore, we compute the co-occurrence statistics of every pair of nodes to generate the pointwise mutual information(PMI) matrix. The negative entries of PMI matrix are set to zero to form positive pointwise mutual information (PPMI) matrix.

Fig. 5
figure 5

A two-layer TempGAN architecture

Computing the temporal co-occurrence information between the nodes from a temporal network with large number of discrete time-steps is computationally intensive. Here, we adopt a sampling-based technique where we approximate the co-occurrence information using the walk sequences generated from a temporal random walk. A temporal walk sequence is a sequences of nodes generated from a temporal walk where l is the length of the random walk. The walk is similar to that of the truncated random walk as defined by [27], but it follows the temporal ordering as given in the definition of temporal walk.The length of the walk is a hyperparameter which can influence order to which the temporal neighbourhood of each node is captured. For example, from Fig. 4, it can be observed that walk length l determines temporal neighbourhood of node A.

To generate walk sequences, we follow the sampling strategy as discussed in [56]. A sampling distribution based on time-steps is precomputed from the graph and is used for initial edge selection. We use a linear sampling distribution where the probability of selecting edge \(e \in E_T\) as initial edge can be computed as:

$$\begin{aligned} P(e) = \frac{\gamma (e)}{\sum _{k \in E_{T}}\gamma (k)}; \end{aligned}$$
(5)

\(\gamma \) is a function that sorts the edges in the ascending order of time and maps each edge to an index with \(\gamma (e) = 1\) for the earliest edge e. After initial edge selection, the next step is to select a temporal neighour which will be the next node in the random walk. Here, also, we use a linear sampling distribution to provide temporal bias to the next node selection, so that walks that exhibit smaller in-between time for consecutive edges will get more bias. If \(\beta \) is a function that sorts the temporal neighbours in the decreasing order of time, the probability of selecting the temporal neighbour \(n \in T_t(u)\) as the next node of node u in time t in the walk can be represented as:

$$\begin{aligned} P(n) = \frac{\beta (n)}{\sum _{l \in T_{t}(u)}\beta (l)}. \end{aligned}$$
(6)

Once the walk sequences are generated, the next step is to compute point wise mutual information between the nodes which can be approximated as the co-occurrence statistics of nodes. The PMI is computed as:

$$\begin{aligned} \mathrm{PMI}(v_i,v_j) = \log ( \frac{P(v_i,v_j)}{P(v_i)P(v_j)} ). \end{aligned}$$
(7)

The negative entries in the PMI matrix are replaced by zero to form the PPMI matrix:

$$\begin{aligned} \mathrm{PPMI}(v_i,v_j)=\max (PMI(v_i,v_j),0). \end{aligned}$$
(8)

Each entry in the PMMI matrix can be used to measure the temporal co-relation between the vertex pairs \(v_i\) and \(v_j\). The value will be high when there exist more time-respecting paths between \(v_i\) and \(v_j\) and will be low if they co-occur very few times in a temporal random walk.

TempGAN

The proposed neural architecture (TempGAN) takes as input, the adjacency matrix A, the node feature matrix F, and PMMI matrix M, and generates the embedding matrix E. A two-layer TempGAN is shown in Fig. 5. First, we discuss the theoretical intuition followed by an example for a better understanding. Like GAT, TempGAN also follows two mechanisms, convolution and attention at each hidden layer of the neural network. Given an initial n x k feature matrix H (\( H_0=F\)), where n is the number of nodes and k is the number of features at each node, the first step is to apply a linear transformation parameterised by weight W to generate high-level features, which can be represented as:

$$\begin{aligned} H'= (WH). \end{aligned}$$
(9)

The next step is to apply self-attention over the nodes parameterized by a shared attention weight \(\overrightarrow{a}\) that can compute the attention coefficient matrix E as:

$$\begin{aligned} E= (\overrightarrow{a}H'). \end{aligned}$$
(10)

TempGAN also uses a leaky relu function to provide non-linearity and a softmax function to normalize the attention coefficients:

$$\begin{aligned} E_{ij}= softmax_j(leakyRelu(E_{ij})). \end{aligned}$$
(11)

Each entry \(E_{ij}\) of E contains the attention coefficient w.r.t. every pair of vertices. We need to consider the \(E_{ij}\) of nodes \(j \in T_i\), where \(T_i\) is the temporal neighbourhood of nodes i in the graph. The temporal neighbourhood of each node can be inferred from the PPMI matrix M. We can redefine attention matrix as:

$$\begin{aligned} \hat{E_{ij}} = {\left\{ \begin{array}{ll} E_{ij},&{}\quad \text { if } M_{ij} + A_{i,j} > 0 \\ 0,&{}\quad \text {otherwise}; \end{array}\right. } \end{aligned}$$
(12)

i.e., for every node i, we need to consider the attention coefficient for those nodes which are in the temporal neighbourhood of i for further propagation and aggregation process. Finally, the propagation step can be represented as:

$$\begin{aligned} H_{l}= \sigma (\hat{E}WH_{l-1}). \end{aligned}$$
(13)

For each node i, the model propagates the transformed features f from the temporal neighbourhood \(T_i\) to i, and the learned attention weights help to differentiate the temporal neighbours based on their importance in connectivity, i.e., a distant temporal neighbour will be given lesser importance during aggregation process which will help to build a more robust model.

Fig. 6
figure 6

Operation of tempGAN w.r.t. node A

The operations of tempGAN across two layers can be explained using Fig. 5. The nodes which are coloured denote temporal connections (other than first-order neighbours) to the source, which can be inferred from the PPMI matrix. The nodes within the temporal proximity of A are B, C, and E. Therefore, the features of the nodes A(self), B, C, and E are used in the convolution and attention process, and are used in generating the latent representation of node A. Similarly, in layer 2, for generating representations for nodes A, B, C, and E, the features of the nodes which are in their temporal proximity are used. For better understanding, the convolution and attention operations done at node A are shown in detail in Fig. 6. The features of the nodes A, B, C, and E are first fed into a linear transformation layer and the latent representations are learned. Further, they are passed to an attention and softmax layer to learn the attention coefficients. Finally, the transformed features from the temporal neighbours A, B C, and E parameterized by the attention coefficients are aggregated to learn the latent representation of node A. The same process happens for all the nodes in the graph.

Application

figure a

Various network mining problems are used in the literature to evaluate the quality of the embeddings generated using representation learning methods. In this work, we use the link prediction problem [66] of temporal networks as the benchmark application. The task is to predict the possibility of link existence between nodes at future time intervals, given the existing links between nodes at known time intervals. We follow a variational graph autoencoder architecture to conduct experiments with link prediction problem. We use TempGAN architecture as the encoding layers and a simple inner product operator as the decoding layer. The flow diagram for TempGAN autoencoder architecture for link prediction is shown in Fig. 7, and the pseudocode of implementation is shown as Algorithm 1. Here, we use a two-layer TempGAN which takes the temporal network as input and generates the mean and log variance w.r.t. every node. The distribution thus generated will be close to N(0, 1). A random sample embedding Z can be generated form distribution using reparameterization trick which can be represented as:

$$\begin{aligned} Z= \mu + \sigma * \epsilon , \end{aligned}$$
(14)

where \(\epsilon \sim N(0,1)\). Furthermore, we can reconstruct the graph information using an inner product decoder which is represented as:

$$\begin{aligned} \hat{A}= \sigma (ZZ^T), \end{aligned}$$
(15)

where \(\sigma \) is the logistic sigmoid function.

Fig. 7
figure 7

TempGAN autoencoder for Link Prediction

Experimental setup

In this section, we demonstrate the effectiveness of the proposed system by conducting link prediction experiments on real-world temporal network datasets. The area under receiver-operating characteristic curve (AUC) and average precision (AP) are the measures used for evaluation. The results are compared with that of the baseline methods. All experiments are conducted using a machine with Ubuntu 18.04 operating system, 16 GB RAM, hexa-core processor with 3.2 GHz speed, and Geforce GTX 1050 Ti GPU. We used python packages for system implementation, which include Networkx for graph processing, Pytorch, and Scikit-learn for building the machine learning modules.

Datasets and evaluation

The temporal network datasets used in experiments are listed below. The datasets are collected from Koblenz Network Collection [65].

IA-Contacts-hypertext 2009 (hypertext)  It is a temporal network which represents the face-to-face proximity between people during ACM hypertext 2009 conference. Nodes represent the attendees and the time-stamped edges represent the interaction between the people over a period of 2.5 days. It contains 113 nodes and 20.8k time-stamped edges.

IA-Enron-Employees (enron)  It is an email communication network between employees of enron Inc. It contains 151 nodes and 50.5k time-stamped edges over a period of 1137 days.

FB Forum (FB)  This is the data collected from an online student community where the nodes represent the students and the time-stamped edges represent the messages posted between them at a particular time-step. It contains 899 nodes and 33.7k time-stamped edges over a period of 164 days.

IA-Radoslaw-Email (radoslaw)  It represents an email communication network of a manufacturing company where nodes represent employees and edges between them are email communications. The graph consists of 167 vertices and 82.9k edges over a period of 271 days.

A statistics of various datasets used is shown in Table 2.

The evaluation measures used to compare the performance of the proposed system with baselines are

AUC  AUC is a widely used evaluation metric for link prediction. This metric can be interpreted as the probability that a randomly chosen missing link is given a higher score than a randomly chosen non-existent link, provided the rank of all the non-observed links. Among n independent comparisons, if there are \(n'\) times the missing link having a higher score and \(n''\) times they have the same score, AUC score is calculated as:

$$\begin{aligned} AUC = \frac{n' + 0.5 n''}{n}. \end{aligned}$$
(16)

Average precision (AP)  It estimates the precision of every prediction and computes the average over all precisions. It is calculated as:

$$\begin{aligned} AP = \varSigma _{n} (R_n -R_{n-1})P_n, \end{aligned}$$
(17)

where \(P_n\) and \(R_n\) are the precision and recall at the \(n^{th}\) threshold.

Baseline methods

A quick introduction to specific baselines is listed below:

Node2vec [28]  Node2vec performs random walks on static network to generate node sequences and uses skip gram with negative sampling to generate node representations. Node2vec performs a biased random walk which provides more flexibility in exploring node neighbourhoods. The learned representations can be used for link prediction using vector-based similarity measures.

Graph convolutional network (GCN) [13]  GCN is a variant of graph neural network which applies a graph convolution at each node to perform the propagation and aggregation of node features from neighbouring nodes. As GCN has been defined for only static networks, we consider the graph to be static for conducting experiments.

Graph attention network (GAT) [20]   GAT is an enhancement over GCN. In addition to convolution and feature aggregation, GAT uses an attention layer to learn self-attention weights that indicate the importance of node j features to node i. As GAT has been also defined for only static networks, we consider the graph to be static for conducting experiments.

Continuous-time dynamic network embeddings (CTDNE) [19]  This work first performs truncated time-respecting random walks over the temporal networks to generate temporal path sequences. Furthermore, a skip-gram objective is trained to generate node embeddings. The learned representations are used in predicting missing links.

Result and analysis

To evaluate the quality of the embeddings, we perform link prediction using TempGAN autoencoder (TempGAN-AE) architecture which is learned by end-to-end training. To test link prediction performance with TempGAN-AE, we hide 10–15 % of temporal links of the original network, generate node embeddings using TempGAN encoder, and reconstruct the original network using the inner product decoder. The whole network is trained using stochastic gradient decent (SGD). Experiment with GCN is conducted using the same procedure without considering temporal information of the links. To test the link prediction performance using Node2vec, the procedure is as follows. Hide 10–15% of the links to form the training set, generate node embeddings from the training set, use Hadamard product of the node embedding to form the edge embedding, and build a classifier based on positive and negative edges. Hidden edges are used to test the accuracy of the classifier. To test link prediction performance with CTDNE, we hide 10–15 % of temporal links from time 1 to ‘t-1’, generate embeddings, and predict the links at time ‘t’. While training the classifier, existing edges are considered as positive samples and the disconnected edges are considered as negative samples. Now, we present the analysis and comparison of results obtained from conducting link prediction experiments on four real-world networks.

Table 2 Statistics of various datasets used

Performance

First, we discuss the various parameter settings used by the proposed system which gained optimum performance. We set the embedding dimension d = [128, 256, 128, 128] as optimum for hypertext, FB, enron, and radoslaw datasets, respectively. We set two hidden layers in TempGAN with sizes ENRON = [256, 128], FB = [512, 256], enron = [256, 128], and radoslaw = [256, 128]. The temporal random walk length is set as l=[6, 8, 6, 4] for hypertext, FB, enron, and radoslaw datasets, respectively. Other hyperparameters include dropout=[0.5, 0.4, 0.5, 0.4], epochs =[600, 750, 500, 600], initial learning rate=.005, and alpha for leaky relu= 0.1 for hypertext, FB, enron, and radoslaw datasets, respectively. The neuron activations are done using relu function. The training is done using zero grad optimizer with binary cross-entropy as the loss function. For GCN, we use the same parameter settings as that of the proposed system. For node2vec and CTDNE, we set walk length = 40, negative samples = 5, and context window size = 10. SVM classifier is used to predict the positive and negative links.

Fig. 8
figure 8

AUC comparison of proposed system with baselines

Table 3 AP comparison of proposed system with baselines
Table 4 Effect of attention mechanism in the proposed system

Now, we compare the performance of TempGAN with three different static network embedding methods (Node2vec, GCN, and GAT) along with one temporal network embedding method (CTDNE). The performance improvement of TempGAN for link prediction over the baseline methods is shown in Fig. 8 and Table 2. Figure 8 depicts the AUC comparison and it can be found that, for hypertext dataset, the proposed system gains a performance improvement of 18.7%, 15.1%, 15.1 %, and 13.4%, and for enron dataset, the proposed system gains a performance improvement of 15.0%, 10.5%, 6.3%, and 7.6% over node2vec, GCN, GAT, and CTDNE, respectively. Similarly, an AUC improvement of 11.8%, 6.2%, 3.6%, and 4.9% is obtained against the baselines for radoslaw dataset. For FB dataset, the proposed system gains an improvement of 2.5% and 6.5% and 2.5% over node2vec, GCN, and GAT, and getting a comparable performance when compared to CTDNE. We observe that the graph convolution-based methods have an advantage for networks that have a higher average node degree (dense network), whereas a truncated random walk-based strategy like CTDNE is more useful in the case of sparse networks like FB. Table 3 provides the average precision comparison of the proposed system with the baselines, which also proves the advantage of proposed system over state-of-the-art network embedding methods. Furthermore, we show the effect of attention mechanism used in the proposed system by conducting experiments with and without attention, and the results are shown in Table 4. Results show that the learned attention weights can further improve the quality of node embeddings and thereby improve the AUC score of link prediction task.

Parameter analysis

In this section, we analyse the effect of various parameter settings used in the experiments which include the length of the temporal walk, number of hidden layers for the TempGAN architecture, and the dimensionality of the node embeddings used in TempGAN autoencoder. We also show the variations in reconstruction loss and AUC at different epochs of network training along with the training time required to complete one epoch of training.

Length of the temporal walk

This parameter can decide the number of hops that a valid temporal walk can cover and therefore is an important parameter which influence the extent to which the temporal information is considered for network embedding. A very low value for walk length l only allows feature aggregation from each nodes very close temporal neighbours where larger values for l allow us to consider features from more distant temporal neighbours. The effect of l on various datasets in shown in Fig 9. For dense networks like enron and radoslaw, an optimum AUC score is obtained while considering features from four hop temporal neighbours. For FB dataset, a walk length covering eight hops provided optimum performance. Setting high values for l will introduce noise which may reduce system performance. We can conclude that the optimum value for l for each dataset depends upon connectivity patterns of the network.

Fig. 9
figure 9

Effect of walk length

Parameters of the neural network

The effect of number of hidden layers for TempGAN architecture is shown in Table 5. We obtained the optimum values for AUC while conducting experiments with two hidden layer TempGAN. Increasing the hidden convolution and attention layers beyond two does not provide any improvements in the results. The embedding dimension is a parameter which can be tuned according to the number of nodes in the input graph. The effect of dimensionality d on AUC w.r.t. various datasets in shown in Fig 10. For hypertext, enron, and radoslaw, the optimum performance is obtained when d=128, and for FB, the value d=256 provided best AUC values.

Table 5 Effect of TempGAN hidden layers on AUC
Fig. 10
figure 10

Effect of dimensionality on AUC

The reduction in reconstruction loss with the increase in the number of epochs is shown in Fig. 11 and the improvement in roc score during learning is shown in Fig. 12. The training time required to complete one epoch for each dataset is shown in Table 6.

Fig. 11
figure 11

Effect of epochs size on reconstruction loss

Fig. 12
figure 12

Effect of epochs size on AUC

Conclusion

Embedding nodes of a network in vector space by preserving its structural properties is a challenging research problem. Among various network embedding methods, graph convolution-based approaches gained more popularity because of its simplicity and effectiveness. In this work, we address the problem of temporal network embedding which aims to map the nodes of a network to vector space by preserving the temporal information. We aim to extend the concept of graph convolution and attention to temporal network data so as to generate time-aware node embeddings. We propose an neural architecture which uses both link and temporal information of the network to generate node embeddings which can be used in many network mining tasks that require end-to-end training. We design a graph autoencoder based on the proposed architecture which performs link prediction on temporal networks. We conducted experiments with real-world temporal networks and compared the results with state-of-the-art methods.

Table 6 Average training time per epoch (s)

In future, we aim to extend the temporal network embedding to more complex systems like epidemic networks which use SIS (susceptible-infected-susceptible) and SIR (susceptible-infected-recovered) modeling. The proposed approach can also be applied to other network mining tasks like node classification and anomaly detection. Extending the work to more complex settings like heterogeneous and signed networks can be another interesting direction for future work.