1 Introduction

In recent years, predicting the future formation of links in a network has become a hot topic. For this reason, the task known as Link Prediction has arisen as an important research area in social network analysis. Several approaches have been proposed to tackle this problem. These approaches can be roughly grouped in content-based, topology-based, and learning-based approaches. However, most of these approaches focus on predicting links without taking into account the fact that there exist different kinds of nodes in a social network, for example, information sources, information seekers, idea starters, commentators, viewers, and influential, among others.

Influential nodes represent users that exert social influence on other users of the social network. Social influence occurs when one’s actions are affected by others. For example, user A exerts influence on user B when B watches a movie because A recommended it previously. Thus, influential users can be seen as effective recommendation sources. Many applications exploit the concept of social influence. In the field of data mining, some applications include viral marketing (Wortman 2008; Monteserin and Armentano 2018), recommender systems (Ye et al. 2012), analysis of information diffusion in Facebook and Twitter (Bakshy et al. 2011), expert finding (Liu et al. 2013), decision support systems (Monteserin and Amandi 2015), analysis of scientist collaboration (Jiang et al. 2017) and ranking of feeds (Ienco et al. 2010), among others.

In this work, we propose an approach for influential links prediction (ILP): given a target user, our approach predicts links to users that could exert social influence on her/him. To do this, an influence maximization algorithm is used to determine a set of possible influential users from the set of current influential users of the target. This kind of algorithm tries to solve a key problem in the area of social networks analysis: the influence maximization problem. The influence maximization problem involves finding a set of users in a social network such that by targeting this set, the expected spread of influence in the network is maximized (Goyal et al. 2011; Kempe et al. 2003). In particular, we apply a data-based approach to social influence maximization, named Credit Distribution (CD) model (Goyal et al. 2011) that learns how influence flows in a network by directly leveraging available propagation traces. In this context, we claim that the set of nodes that also influence the nodes influenced by the set of current influential users of the target are potential new influential nodes that can be suggested as new connections to the target user. Thus, ILP searches for these nodes by using an adapted version of the CD model.

We claim that our approach is particularly useful for link prediction in scenarios with low homophily, where content-based approaches fail to capture similarities between nodes. Homophily is the tendency of individuals to associate and bond with similar others (Bonchi 2011). Homophily is usually taken into account by several Link Prediction approaches (for example, content-based algorithms, see Sect. 2.1). However, social influence is not the same as homophily (Aral et al. 2009; La Fond and Neville 2010). If social influence effects are present in a social network, nodes are likely to change their attributes to conform to their neighbors values. In contrast, if homophily effects are present in the network, individuals (nodes) are likely to link to other individuals with similar attribute values (Aral et al. 2009).

To validate our approach, we carried out a set of experiments in the movies domain (Flixster) and microblogging domain (Twitter). We compared the precision, recall, nDCG and AUC of ILP with respect to the main topological metrics studied in the literature for link prediction (common neighbors, Jaccard, Sørensen, and Adamic–Adar, among others Wang et al. 2015) and with respect to a learning-based approach. This comparison showed that our approach performed better that existent approaches. Moreover, we carried out a comparison of the performance of the topological metrics in two different configurations set up by varying the set of neighbors observed: only influential neighbors and all neighbors. This comparison showed that the best predictions were obtained when the topological metrics only observe the set of influential neighbors.

The article is organized as follows. Section 2 introduces some concepts related to link prediction and social influence maximization. Section 3 presents the approach to predict links to influential users. Section 4 shows the results obtained from the experiments. Finally, Sect. 5 presents our conclusions and future works.

2 Background

In this section, we describe two relevant fields, namely, Link Prediction and Social Influence Maximization. In the next section, we define the problem of link prediction and some of the main approaches to the problem. Next, we introduce the main concepts on social influence maximization, propagation models and algorithms.

2.1 Link prediction

Link prediction for social networks was formalized by Liben-Nowell and Kleinberg as the problem of predicting the edges that will be added to a given snapshot of a social network during the time period determined from time t to a future time t’ (Liben-Nowell and Kleinberg 2003). Formally, given a snapshot of a network at time t, \(G_{t}(V,E)\) where V is the set of nodes and E is the set of links, we seek to find the set of edges \(E'\) from all the \((|V|\cdot (|V|-1))-|E|\) possible links among nodes in V that will appear in the network at time \(t'\), \(G_{t'}(V,E')\). It is worth noticing that the edges in the network can represent both connection and interactions between nodes.

Link prediction methods can be roughly grouped into three categories: content-based, topology-based, and learning-based.

Content-based algorithms assign to each pair of nodes x and y a similarity score sim(xy) that is computed using the attributes of the nodes, such as the user profiles (Bhattacharyya et al. 2010), user-generated content (Armentano et al. 2013), documents information (Perlich et al. 2009), user interests (Anderson et al. 2012), etc. The similarity score for all pairs of non-connected nodes is computed and the edges with top N scores or, alternatively, with a score over a certain threshold, are predicted. These methods work under the assumption that users tend to relate with people who are similar to them in certain way. In other words, content-based algorithms assume that users in the network follow the principle of homophily.

Topology-based methods can be applied to any network, even if there is no information available about the nodes. These methods compute different metrics between pairs of nodes that are then used to rank the possible connections to be predicted, similarly to content-based methods. Metrics used by these methods can be divided into local (or neighbor-based) metrics, path-based metrics and random walk metrics. Liben-Nowell and Kleinberg 2003 presented different methods for link prediction based on node neighborhoods and on the ensemble of all paths. The simplest neighbor-based metric considers the number of common neighbors between two nodes (CN). Many metrics are based on CN and intend to normalize this metric with different criteria. For example, Jaccard Coefficient, use the total number of neighbors between the two nodes; Sørensen–Dice Index considers that lower degrees of nodes would have higher link likelihood, Hub promoted considers that the topological overlap is determined by the lower degree of nodes, while Hub Depressed determines the value by the higher degrees of nodes. Other approaches combine some of these topological metrics and their weighted versions (Armentano et al. 2012; Güneş et al. 2016). Among the path-based metrics, Local Path (Lü et al. 2009), Katz metric (Katz 1953) and Relation Strength Similarity (Chen et al. 2012) are commonly used for link prediction. Finally, random walk metrics use transition probabilities from a node to its neighbors to denote the destination of a random walker that departs from the current node. The random walk information can be used to measure the distance between any pair of nodes. Nodes are then sorted by shortest distance to select which edges to predict. Classical algorithms in this category are Hitting Time (Fouss et al. 2007), PageRank (Page et al. 1999) and it variants.

Finally, learning based methods approach link prediction as a binary classification problem. Each pair of nodes x and y is considered an instance that is described by a set of features (which are usually built from the metrics described previously) and a class label (\(+\) if there exist an edge connecting x and y or − otherwise). Any classifier can then be used to predict the class for non-existent edges, such as decision trees (Scellato et al. 2011), naïve Bayes (Scellato et al. 2011), support vector machines (Li and Chen 2013), logistic regression (Chiang et al. 2011), frequent graph pattern mining (Pobiedina and Ichise 2016), and matrix alignment (Scripps et al. 2008). The main problem that has to be addressed when considering link prediction as a classification task is that the classes are inherently unbalanced for most networks, since the number of links that may appear represent a very small subset of the possible links that can be established in the network between any pair of nodes.

Recently, research in link prediction has also focused on dynamic networks (Rahman and Hasan 2016; Choudhury and Uddin 2017, 2018). This line of research considers that the behavior and characteristics of the nodes and the links among them change temporally. For these kind of networks, new set of metrics needs to be defined in order to measure the similarity between each pair of actors.

In this article, we focused on the topology structure of the network to locally find a set of candidate influential nodes of the target node. User actions on the network are used to determine the influence exerted on other users.

2.2 Social influence

Social influence occurs when a person’s actions are affected by others. This effect can be seen in conformity, socialization, peer pressure, obedience, leadership, persuasion, sales, and marketing (Goyal 2013). Social influence is defined as the change in an individual’s thoughts, feelings, attitudes, or behaviors that results from the interaction with another individual or a group (Rashotte 2006). Many applications exploit the social influence and the propagation of influence that users of a social network exert on other users has been widely studied in recent years.

A key problem in this area is the identification of influential users (Goyal et al. 2011). Kempe et al. (2003) formalized this problem as the influence maximization problem: given a directed graph\(G=(V,E,p)\), where nodes are users and edges are labeled with influence probabilities among users, the influence maximization problem looks for a set of seeds (users) that maximizes the expected spread of influence in the social network under a given propagation model. A propagation model indicates how influence propagates through the network. Two propagation models were proposed by Kempe et al.: the Independent Cascade (IC) and the Linear Threshold (LT) models. In both models, each node can be either active or inactive at a given moment. Moreover, the tendency of each node to become active increases monotonically as more of its neighbors become active.

Given a propagation model m (for example, IC or LT) and an initial seed set \(S\subseteq V\), the expected number of active nodes at the end of the process is the expected (influence) spread, denoted by \(\sigma _{m}(S)\) (Goyal et al. 2011). Then, the influence maximization problem is defined as follows: given a directed and edge-weighted social graph\(G=(V,E,p)\)(where nodes are users and edges are labeled with influence probabilities among users), a propagation model m, and a number\(k\le \left| V\right| \), find a set\(S\subseteq V\), \(\left| S\right| =k\), such that\(\sigma _{m}(S)\)is maximum. Several approaches have been developed to solve this problem. Despite the fact that this problem is NP-hard under both the IC and LT propagation models, some characteristics of the function \(\sigma _{m}(S)\) (monotonicity and submodularity, see Kempe et al. 2003 for further details) made it possible to develop a greedy algorithm to solve the problem.

One of the limitations of the IC and LT propagation models is that the edge-weighted social graph is assumed as input to the problem, without addressing the question of how the probabilities are obtained (Goyal et al. 2010). For this reason, Goyal et al. (2011) proposed the CreditDistribution (CD) model, which directly estimates influence spread by exploiting historical data. In this context, the influence maximization problem to be solved under the CD model is reformulated as follows: given a directed social graph\(G=(V,E)\), an action log\({\mathbb {L}}\), and a integer\(k\le \left| V\right| \), find a set\(S\subseteq V\),\(\left| S\right| =k\), such that\(\sigma _{cd}(S)\)is maximum. Under the CD model, \(\sigma _{cd}(S)\) is defined as \(\sigma _{cd}(S)=\sum _{u\in V}\kappa _{S,u}\), where \(\kappa _{S,u}\) represents the total credit given to S for influencing u for all actions. To solve this problem, Goyal et al. developed an algorithm for influence maximization under the CD model. This algorithm initially scans the action log \({\mathbb {L}}\) to learn the influence probabilities in the social network, computing the influenceability scores for the users. An action log is a set of triples \((u,\,a,\,t)\) which say user u performed action a at time t. Then, the seed set is selected under the CD model by using a greedy algorithm with CELF optimization (Goyal et al. 2011). It is worth noticing that if the timestamp in which an edge was created is available, the algorithm considers this information to find the seed set. This allows our approach to work with both static and dynamic networks. See Goyal et al. (2011) for further details on the algorithm implementation.

3 Link prediction of influential nodes

In this section, we present ILP, our approach to predict links to influential nodes. ILP is based on the social influence exerted among users in a social network. In this context, we claim that it is possible to predict links to influential nodes by observing which users also influence the set of users influenced by the current influential users of the target. Here, the target is the user to which the approach will recommend influential users (new links).

Figure 1 shows the 4 steps of the proposed approach. First, our approach searches for nodes that influence the target user (Fig. 1, Step 1). To do this, we reformulate the influence maximization problem under the CD model by adding the parameter T to the definition problem. Then, the influence maximization problem is defined as follows: given a directed social graph\(G=(V,E)\), a subset\(T\subseteq V\), an action log\({\mathbb {L}}\), and a integer\(k\le \left| V\right| \), find a set\(S\subseteq V\),\(\left| S\right| =k\), such that\(\sigma _{cd}^{T}(S)=\sum _{t\in T}\kappa _{S,t}\)is maximum. In other words, we modify the problem definition to obtain a seed set by taking into account only the influence exerted on the nodes \(t\in T\). Notice that when \(T=V\) the problem becomes the traditional one. Thus, the first step of ILP is carried out by running the CD model with \(T=\{{\textit{Target}}\}\). The result of this step is the set \(I_{\textit{Target}}\), composed of the nodes that influence the target. Moreover, it is worth noticing that the greedy algorithm always returns a seed set S with k elements. However, it is possible that the last elements added to the seed set do not actually exert influence on T. This occurs whenever T is influenced only by l nodes and \(l<k\). For this reason, we include a threshold mininf, and only keep in \(I_{\textit{Target}}\) the nodes whose marginal gain exceed mininf, where the marginal gain of a node w is computed as \(\sigma _{m}(S\,\cup \,\{w\})-\sigma _{m}(S)\) (Goyal et al. 2011).

Secondly, ILP searches for nodes influenced by the set \(I_{\textit{Target}}\) (Fig. 1, Step 2) and stores these nodes in the set \({\textit{NI}}^{I_{\textit{Target}}}\). Although this step is not defined as an influence maximization problem, ILP uses the concept of credit distribution to search for these nodes. Thus, we define \({\textit{NI}}^{I_{\textit{Target}}}=\{ni\in V\mid ni\ne Target\,\wedge \,\sum _{i\in I_{\textit{Target}}}\kappa _{i,ni}>0\}\). In other words, \({\textit{NI}}^{I_{\textit{Target}}}\) is composed of the nodes that gives credits to at least one node included in \(I_{\textit{Target}}\) (excluding the target).

Fig. 1
figure 1

Steps of the ILP approach

The third step consists in searching for a new set of nodes (\({\textit{New}}_{{\textit{NI}}}\)) that also influence \({\textit{NI}}^{I_{\textit{Target}}}\) (Fig. 1, Step 3). To do this, we use the same influence maximization problem definition explained in Step 1 with \(T={\textit{NI}}^{I_{\textit{Target}}}\). In contrast with Step 1, \({\textit{New}}_{{\textit{NI}}}\) must be filtered since the nodes of \(I_{\textit{Target}}\) might have been also included in the seed set. Finally, in Step 4, ILP uses the set \({\textit{New}}_{{\textit{NI}}}\) to recommend new influential links for target (Fig. 1, Step 4).

Fig. 2
figure 2

Example of social and derived influence graph with node 1 as target

To illustrate our proposal, Fig. 2a shows an example of a simple directed social graph. This graph is composed of 9 nodes and 13 directed edges. The direction of the edge between a node A and a node B indicates that A follows B. Notice that influence flows contrary to the direction indicated by the edges. Table 1 shows the actions log used to learn the influence probabilities. This log has 3 columns: the id of the node that perform the action, the id of the action and the time when the action was performed. Following the steps presented above, with \({\textit{Target}}=1\), the first step results in a set \(I_{\textit{Target}}=\{3,4\}\). Then, the second step of our approach searches for other nodes that are also influenced by nodes 3 and 4. Consequently, \({\textit{NI}}^{I_{\textit{Target}}}=\{5,6\}\). Next, ILP looks for nodes that exert influence on nodes 5 and 6. Thus, the third step returns the nodes 9, 3 and 4 (\({\textit{New}}_{\textit{NI}}=\{9,3,4\}\)). However, since nodes 3 and 4 are included in \(I_{\textit{Target}}\), the final \({\textit{New}}_{{\textit{NI}}}\) is 9. Finally, node 9 becomes a potential link to be recommended to node 1. Figure 2b shows the derived influence graph when the target is node 1. This graph shows the influence relationships among the nodes from the point of view of the target.

Table 1 Example of actions log

4 Experimental evaluation

4.1 Experimental settings

To evaluate our approach, we ran experiments comparing the performance of ILP with different well-known topological metrics for link prediction. We experimented on two well-known real-world dataset extracted from Flixster (Jamali and Ester 2010) and Twitter (De Domenico et al. 2013).

The Flixster datasetFootnote 1 is composed of 786,936 nodes; 7,058,819 directed edges; and 8,196,077 logged actions. Since FlixsterFootnote 2 is one of the main players in the social movie rating businesses, each action represents a user rating a movie. Thus, if user v rates “Frozen”, and later v’s friend u does the same, we consider that the action of rating “Frozen” propagated from v to u (Goyal et al. 2011). Flixster dataset was chosen because it is a real-world dataset in which the similarity between linked users is low. To check this fact, we computed the average Pearson similarity (\(\frac{\sum _{\forall u\rightarrow v}simPearson(u,v)}{|u\rightarrow v|}=0.008\)) and the average GroupLens similarity (\(\frac{\sum _{\forall u\rightarrow v}simGroupLens(u,v)}{|u\rightarrow v|}=0.001\)) by using Eqs. 1 and 2, respectively.

$$\begin{aligned} simPearson(u,v)= & {} \frac{\sum _{i\in I_{u}\cap I_{v}}(r_{ui}{-}\mu _{u})(r_{vi}{-}\mu _{v})}{\sqrt{\sum _{i\in I_{u}\cap I_{v}}(r_{ui}{-}\mu _{u})^{2}}\sqrt{\sum _{i\in I_{u}\cap I_{v}}(r_{vi}{-}\mu _{v})^{2}}} \end{aligned}$$
(1)
$$\begin{aligned} simGroupLens(u,v)= & {} \frac{\sum _{i\in I_{u}\cup I_{v}}(r_{ui}{-}\mu _{u})(r_{vi}{-}\mu _{v})}{\sqrt{\sum _{i\in I_{u}\cup I_{v}}(r_{ui}{-}\mu _{u})^{2}}\sqrt{\sum _{i\in I_{u}\cup I_{v}}(r_{vi}{-}\mu _{v})^{2}}} \end{aligned}$$
(2)

Both similarity metrics measure the difference between the rating given by user u and user v to a given item i. Since different users may use different scales while rating items, the equation considers the difference between the actual rating of the user to the item and the average rating of the user to all items \(\mu _{u}\). The equation is then normalized so that the metric takes values between \(-\,1\) (users rating items in an opposite manner) and 1 (users rating items in the same manner). Both metrics will be 0 for totally different users. The main difference between simPearson(u,v) and simGroupLens(u,v) is the set of items used to measure the similarity. Pearson correlation consider all items that were rated by both users (\(I_{u}\bigcap I_{v}\)). This causes that users with few rated items in common will have high similarities. If two users have rated many items and have only one item in common with the same rating, the metric will be 1, indicating that the users are similar while this might not be the case. For this reason, simGroupLens considers all the items rated by both users (\(I_{u}\cup I_{v}\)), with the normalized rating \(r_{ui}-\mu _{u}=0\) whenever u has not rated i. With this modification to the equation, if both users have rated exactly the same items, it remains the Pearson correlation. However, if one user has rated items that the other has not, those ratings drop out of the numerator (since they are multiplied by 0) but still contribute to the denominator.

It is important to consider this fact because a low similarity indicates low homophily, making unfavorable the application of content-based algorithms as we explain in Sect. 2.1.

On the other hand, the dataset extracted from TwitterFootnote 3 was built after monitoring the spreading processes on Twitter before, during and after the announcement of the discovery of the Higgs boson. For this reason, we will refer to this dataset as Higgs dataset. The dataset is composed of the messages posted in Twitter about this discovery between 1st and 7th July 2012. To build the action log, we considered the tweets and retweets as actions. Thus, if a user v posts a tweet t, and later on a user u retweets this tweet, we consider t as an action propagated from v to u. Higgs dataset is composed of 456,626 nodes, 14,855,842, and 396,356 logged actions.

4.2 Procedure

We ran experiments by randomly selecting 1400 target users from each network. Then, for each target, we applied the first step of ILP and obtained the set \(I_{\textit{Target}}\) with \(k=20\) and \(mininf=1.01\) (these values were also used to configure step 3 of ILP).Footnote 4 In Flixster, the average k of each \(I_{\textit{Target}}\) processed was 7.2. For this reason, we applied a cross-validation technique with 4 folds. We decided to uses 4 folds since for a k-folds cross-validation we need that each user has at least k influential nodes to become a valid target, and the average amount of influential nodes of all nodes in the Flixster graph was 3.2. Thus, we discarded those users with less than 4 influential nodes during the target selection process. The same configuration was used for the Higgs dataset.

The cross-validation process consisted in picking a target, hiding each fold \(F_{i}\), one at a time, and running the rest of the steps of ILP on the remaining 3 folds (i.e. \(I_{\textit{Target}}-F_{i}\)). Notice that the cross-validation process was carried out by considering \(I_{\textit{Target}}\), which is the set of influential nodes, since our goal is to recommend links to influential users. Finally, we compared the results obtained, \({\textit{New}}_{{\textit{NI}}}\), with the hidden fold and computed precision and recall measures using Eqs. 3 and 4 with \({\textit{New}}={\textit{New}}_{\textit{NI}}\), respectively. Moreover, we computed the Normalized Discounted Cumulative Gain (nDCG) measure (Wang et al. 2013) using Eq. 5. In Eq. 5, \({\textit{DCG}}\) is computed using Eq. 6, where \(rel_{i}=1\) if the link in the position i of \({\textit{New}}\) was in the hidden fold and \(rel_{i}=0\) if not. In addition, \({\textit{IDCG}}\) (ideal DCG) represents the DCG measure for the perfect ranking. NDCG is a normalized measure of ranking quality. The premise of nDCG is that relevant links appearing lower in a result ranking should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result.

$$\begin{aligned}&{\textit{precision}}_{{\textit{New}}}=\frac{\left| \left\{ x\,|\,x\in F_{i}\,\wedge x\in {\textit{New}}\right\} \right| }{\left| {\textit{New}}\right| } \end{aligned}$$
(3)
$$\begin{aligned}&\quad {\textit{recall}}_{{\textit{New}}}\frac{\left| \left\{ x\,|\,x\in F_{i}\,\wedge x\in {\textit{New}}\right\} \right| }{\left| F_{i}\right| } \end{aligned}$$
(4)
$$\begin{aligned}&\quad {\textit{nDCG}}_{\textit{New}}=\frac{\textit{DCG}_{\textit{New}}}{\textit{IDCG}} \end{aligned}$$
(5)
$$\begin{aligned}&\quad {\textit{DCG}}_{\textit{New}}=\sum _{i=1}^{\left| {\textit{New}}\right| }\frac{2^{rel_{i}}-1}{\textit{log}_{2}(i+1)} \end{aligned}$$
(6)

Furthermore, we ran predictions by using the following state-of-the art topological metrics for link prediction (Wang et al. 2015), where \(\varGamma (x)\) is the set of neighbors of node x, and \(\left| \varGamma (x)\right| \) is the number of neighbors of nodes x.

  • Common Neighbor\(({\textit{CN}})\) this metric is one of the most widespread measurement used in link prediction due to its simplicity. CN is defined as the number of nodes that two nodes, x and y, have a direct interaction with (Eq. 7).

    $$\begin{aligned} {\textit{CN}}(x,y)=\left| \varGamma (x)\cap \varGamma (y)\right| \end{aligned}$$
    (7)
  • Jaccard Coefficient\(({\textit{JC}})\) this coefficient normalizes the size of common neighbors with the total number of neighbor that x and y have (Eq. 8).

    $$\begin{aligned} {\textit{JC}}(x,y)=\frac{\left| \varGamma (x)\cap \varGamma (y)\right| }{\left| \varGamma (x)\cup \varGamma (y)\right| } \end{aligned}$$
    (8)
  • Sørensen Index\(({\textit{SI}})\) besides taking into account the size of the common neighbors, it also points out that lower degrees of nodes would have higher link likelihood (Eq. 9).

    $$\begin{aligned} {\textit{SI}}(x,y)=\frac{\left| \varGamma (x)\cap \varGamma (y)\right| }{\left| \varGamma (x)\right| +\left| \varGamma (y)\right| } \end{aligned}$$
    (9)
  • Salton Cosine Similarity\(({\textit{SC}})\) this metric is a common cosine metric for measuring the similarity between two nodes (Eq. 10).

    $$\begin{aligned} {\textit{SC}}(x,y)=\frac{\left| \varGamma (x)\cap \varGamma (y)\right| }{\sqrt{\left| \varGamma (x)\right| \cdot \left| \varGamma (y)\right| }} \end{aligned}$$
    (10)
  • Hub Promoted\(({\textit{HP}})\) it defines the topological overlap of nodes x and y. The HP value is determined by the lower degree of nodes (Eq. 11).

    $$\begin{aligned} {\textit{HP}}(x,y)=\frac{\left| \varGamma (x)\cap \varGamma (y)\right| }{min(\left| \varGamma (x)\right| ,\left| \varGamma (y)\right| )} \end{aligned}$$
    (11)
  • Hub Depressed)\(({\textit{HD}})\) this metric is similar to HP, but the value is determined by the higher degrees of nodes (Eq. 12)

    $$\begin{aligned} {\textit{HD}}(x,y)=\frac{\left| \varGamma (x)\cap \varGamma (y)\right| }{max(\left| \varGamma (x)\right| ,\left| \varGamma (y)\right| )} \end{aligned}$$
    (12)
  • Leicht–Holme–Nerman\(({\textit{LHN}})\) this metric assigns high similarity to node pairs that have many common neighbors compared not to the possible maximum, but to the expected number of such neighbors (Eq. 13).

    $$\begin{aligned} {\textit{LHN}}(x,y)=\frac{\left| \varGamma (x)\cap \varGamma (y)\right| }{\left| \varGamma (x)\right| \cdot \left| \varGamma (y)\right| } \end{aligned}$$
    (13)
  • Adamic–Adar Coefficient\(({\textit{AA}})\) this coefficient was initially proposed for computing similarity between two web pages. In this metric, common neighbors that have fewer neighbors are weighted more heavily (Eq. 14)

    $$\begin{aligned} {\textit{AA}}(x,y)=\sum _{z\in \varGamma (x)\cap \varGamma (y)}\frac{1}{log\left| \varGamma (z)\right| } \end{aligned}$$
    (14)
  • Preferential Attachment\(({\textit{PA}})\) it indicates that new links will be more likely to connect higher-degree nodes than lower ones (Eq. 15).

    $$\begin{aligned} {\textit{PA}}(x,y)=\left| \varGamma (x)\right| \cdot \left| \varGamma (y)\right| \end{aligned}$$
    (15)
  • Resource Allocation\(({\textit{RA}})\) RA metric is similar to AA. Both metrics suppress the contribution of the high degree common neighbors. However, RA metric punishes the high degree common neighbors more heavily than AA (Eq. 16).

    $$\begin{aligned} {\textit{RA}}(x,y)=\sum _{z\in \varGamma (x)\cap \varGamma (y)}\frac{1}{\left| \varGamma (z)\right| } \end{aligned}$$
    (16)

Additionally, we ran predictions using a learning-based approach (LB). This approach consisted in a Logistic Regression-based classifier (using WekaFootnote 5 framework). For each target T and fold i, we trained a classifier whose input consisted in instances of the form \(\{{\textit{CN}}\left( T,y\right) ,\,{\textit{JC}}\left( T,y\right) ,\,{\textit{SI}}\left( T,y\right) ,\,{\textit{SC}}\left( T,y\right) ,\,{\textit{HP}}\left( T,y\right) ,\,{\textit{HD}}\left( T,y\right) ,\,{\textit{LHN}}\left( T,y\right) ,\,{\textit{AA}}\left( T,y\right) ,\,{\textit{PA}}\left( T,y\right) ,\,{\textit{RA}}\left( T,y\right) ,\,class\}\) where \(y\in ((\varGamma (T)-F_{i})\,\cup \,{\textit{NL}}_{\textit{training}}\subset \{nl\mid nl\notin \varGamma (T)\})\) (\({\textit{NL}}\) was randomly selected), \(\left| \varGamma (T)-F_{i}\right| =\left| {\textit{NL}}\right| \) (to keep the training set balanced), and class was LINK if \(y\in (\varGamma (T)-F_{i})\) or NO-LINK if \(y\in {\textit{NL}}\). Once the classifier was trained, we tested it by using a set of instances for which \(y\in F_{i}\cup (\{nl\mid nl\notin \varGamma (T)\}-{\textit{NL}}_{\textit{training}}).\) Due to time execution limitations, we reduced the number of no-links in the testing set to 20,000 (randomly selected). Notice that this reduction benefited the performance of the learning-based approach.

Moreover, for each baseline \(b\in \{{\textit{CN}},\,{\textit{JC}},\,{\textit{SI}},\,{\textit{SC}},\,{\textit{HP}},\,{\textit{HD}},\,{\textit{LHN}},\,{\textit{AA}},\,{\textit{PA}},\,{\textit{RA}},\,{\textit{LB}}\}\), we built two rankings, according to the value of the metric or the likelihood associated to each testing instance of belonging to the LINK class in a descendant order:

  • Ranking \({\textit{New}}_{I}^{b}\) with \(\varGamma (x)=I_{\textit{Target}}-F_{i}\), that is, using the same set of nodes used by our approach.

  • Ranking \({\textit{New}}_{\textit{All}}^{b}\) with \(\varGamma (x)=\varGamma ({\textit{Target}})-F_{i}\), that is, using all the neighbors of Target without the hidden nodes.

From each ranking, we took the 20 top ranked nodes as recommendations to the target users. Finally, we also computed precision, recall and nDCG for each ranking using Eqs. 3, 4 and 5 with \({\textit{New}}={\textit{New}}_{I}^{b}\) and \({\textit{New}}={\textit{New}}_{\textit{All}}^{b}\).

In addition, we compute the area under the receiver operating characteristic curve (AUC), since this is a standard metric used to quantify the accuracy of different link prediction methods (Ding et al. 2016; Lü and Zhou 2011; Dai et al. 2017). This metric can be interpreted as the probability that a randomly chosen missing link is given a higher score than a randomly chosen nonexistent link (Lü and Zhou 2011). Among n independent comparisons, if there are \(n^\prime \) occurrences of missing links having a higher score and \(n^{\prime \prime }\) occurrences of missing links and nonexistent link having the same score, we define the accuracy as: \({\textit{AUC}}=(n^\prime +0.5n^{\prime \prime })/n\). Then, if all the scores are generated from an independent and identical distribution, the accuracy should be about 0.5. Therefore, the degree to which the accuracy exceeds 0.5 indicates how much better the algorithm performs than pure chance (Lü and Zhou 2011).

4.3 Results

4.3.1 Flixster dataset

Table 2 shows the evolution of precision, recall, nDCG and AUC measures as the number of predictions increases (from 5 to 20). As we can see, ILP improved the classic measures in all the scenarios. The best precision was obtained by ILP with 5 predictions (3.2%) whereas the best recall was obtained with 20 predictions (15.5%). We found a significant improvement of the precision and recall of ILP with respect to the best topological metrics (CN, JC and AA): 28 and 20.15%, respectively (\(p<0.05\)). The learning-based approach did not improve the metrics obtained for ILP, even though the number of no-links in the testing set was reduced.

On the other hand, we can observe that the topological metrics obtained better results generating \({\textit{New}}_{I}\) than \({\textit{New}}_{\textit{All}}\). For example, the best precision for \({\textit{New}}_{I}\) was obtained using CN, JC and AA metrics (2.5%). In contrast, for \({\textit{New}}_{\textit{All}}\), the best precision was 1.7%. This represents a significant difference of 47.06% between \({\textit{New}}_{I}\) and \({\textit{New}}_{\textit{All}}\) (\(p<0.05\)). Something similar occurred with the recall: for \({\textit{New}}_{I}\), the best value was 12.9% using AA with 20 predictions, whereas the best value for \({\textit{New}}_{\textit{All}}\) was 11.7% using also AA (difference of 10.25% with \(p<0.05\)).

Table 2 Comparison of precision (P), recall (R), NDCG (N) and AUC for Flixster dataset

Regarding nDCG, ILP obtained a value of 0.097 in contrast to 0.084 obtained by LB. Moreover, the best nDCG value for the topological metrics was 0.082 and was also obtained for \({\textit{New}}_{I}\) using CN, whereas the best value for \({\textit{New}}_{\textit{All}}\) was 0.06 but using AA (a significant difference of 36%). It is worth noticing that contrary to what happens with the topological metrics, the learning-based approach presented a worse performance using \({\textit{New}}_{I}^{\textit{LB}}\) than \({\textit{New}}_{\textit{All}}^{\textit{LB}}\). This is because the training set is reduced, since \(|I_{\textit{Target}}|<|\varGamma ({\textit{Target}})|\), negatively affecting the performance of the classifier.

Fig. 3
figure 3

Comparison of AUC measure between ILP and baselines generating \({\textit{New}}_{I}\) (Flixster dataset)

Figure 3 shows a comparison of the AUC measure obtained by ILP and the AUC measures obtained by the baselines building \({\textit{New}}_{I}\). The X-axis represents the number of recommendations (links predicted) and Y-axis represents the AUC measure. As we can see, ILP obtained a better AUC measure when the number of recommendation is less than or equal to 100. With more than 100 recommendations, the AUC obtained with ILP increased slightly and was overcome by some of the topological metrics (CN, JC, AA and RA). This happens because the total number of recommendations of ILP is less than the total number of recommendation that the topological metrics are able to recommend. In fact, only an average of 40.63 new connections found with ILP have a marginal gain higher than 1.01. However, we think that this is not a limitation of ILP due to the fact that when recommending new connections for users of a social network, recommendations lists tend to be short (frequently less than 20) in order to help users to focus on the most relevant results.

Fig. 4
figure 4

Comparison of AUC measure between ILP and baselines generating \({\textit{New}}_{\textit{All}}\) (Flixster dataset)

Figure 4 compares the AUC measure obtained by ILP with state of the art topological metrics and the learning-based approach generating \({\textit{New}}_{\textit{All}}\). As occurred with \({\textit{New}}_{I}\), ILP showed a better AUC value when the number of links predicted was lower than 20. In this case, the AUC measure of AA and RA metrics mildly overcome ILP when \(n=100\). In contrast, the learning-based approach overcame ILP results for more than 20 predictions. However, it is important to remark that due to execution time limitation the testing set used to test LB was significantly reduced. This reduction clearly improved the performance of the LB approach. For this reason, we think that the AUC will be worse with the full testing set.

Finally, Fig. 5 shows a comparison between topological metrics building \({\textit{New}}_{I}\) and \({\textit{New}}_{\textit{All}}\). As we can see, the approach using \({\textit{New}}_{I}\) obtained better results than using \({\textit{New}}_{\textit{All}}\) for metrics CN, JC, HP, AA and RA when \(n\le 20\). However, when \(n>20\), the AUC measure obtained using \({\textit{New}}_{\textit{All}}\) outperformed the AUC measure obtained using \({\textit{New}}_{I}\). As with ILP, this happens because the maximum number of link predictions that the metrics can recommend for \({\textit{New}}_{I}\) is less than for \({\textit{New}}_{\textit{All}}.\) In turn, this is because \(I_{\textit{Target}}-F_{i}\) (used to generate \({\textit{New}}_{I}\)) is less than \(\varGamma ({\textit{Target}})-F_{i}\) (used to generate \({\textit{New}}_{\textit{All}}\)). Nevertheless, \({\textit{New}}_{I}\) for HP metric was not affected since the HP value is determined by the lower degree of nodes (Wang et al. 2015). In contrast, HD obtained better results using \({\textit{New}}_{\textit{All}}\), since HD value is determined by the higher degrees of nodes. The rest of the metrics showed a worse performance generating \({\textit{New}}_{I}\).

Fig. 5
figure 5

Comparison of AUC measure between topological metrics using \({\textit{New}}_{I}\) and \({\textit{New}}_{\textit{All}}\) (Flixster dataset)

4.3.2 Higgs dataset

The results obtained with the Higgs dataset were more promising than those obtained with Flixster. Table 3 shows precision, recall, nDCG and AUC measures for 5, 10, 15 and 20 predictions. Similar to Flixster dataset, ILP obtained the best performance. The best precision was 6.9% obtained with ILP in 5 predictions (an improvement of 97.14% compared to the result obtained by \({\textit{New}}_{I}^{\textit{LB}}\) also in 5 predictions), while the best recall was 35.8% obtained with ILP in 20 predictions (an improvement of 47.93% compared to the result obtained by \({\textit{New}}_{I}^{\textit{LB}}\) also in 20 predictions).

Table 3 Comparison of precision (P), recall (R), NDCG (N) and AUC for Higgs dataset

Comparing the results obtained by the topological metrics with \({\textit{New}}_{I}\) and \({\textit{New}}_{\textit{All}}\), we observed the same patterns that using Flixster. That is, the best results was obtained with \({\textit{New}}_{I}\). However, contrary to what happens with Flixster dataset, the learning-based approach presented better performance also generating \({\textit{New}}_{I}\) than \({\textit{New}}_{\textit{All}}\). We think that this is also related to the best results obtained by ILP with Higgs dataset.

Figures 6 and 7 compare the AUC measure obtained by ILP with the baseline approaches generating \({\textit{New}}_{I}\) and \({\textit{New}}_{\textit{All}}\), respectively. These figures also show that the best results were obtained with ILP, particularly, when the number of predictions was lower than 150. As with the experiments with Flixster, the AUC obtained by ILP grew rapidly with few predictions, but then slowed its growth.

Fig. 6
figure 6

Comparison of AUC measure between ILP and baselines generating \({\textit{New}}_{I}\) (Higgs dataset)

Fig. 7
figure 7

Comparison of AUC measure between ILP and baselines generating \({\textit{New}}_{\textit{All}}\) (Higgs dataset)

Finally, Fig. 8 also shows a comparison of AUC metric obtained by topological metrics generating \({\textit{New}}_{I}\) and \({\textit{New}}_{\textit{All}}\). These results were similar to those obtained with Flixster dataset. However, we can observe a more significant difference between the AUCs when the curve generated with \({\textit{New}}_{I}\) overcame the curve generated with \({\textit{New}}_{\textit{All}}\), such is the case of CN, JC, AA and RA.

Fig. 8
figure 8

Comparison of AUC measure between topological metrics using \({\textit{New}}_{I}\) and \({\textit{New}}_{\textit{All}}\) (Higgs dataset)

5 Conclusions and future work

In short, we highlight two main contributions of this work. First, we presented a new approach to predict links to influential users. Additionally, we presented an experimental analysis that shows that topological metrics have a better performance predicting links to influential users when they are applied over the set of current influential users of the target. We found that with ILP we can improve the link prediction performance with respect to classical topological metrics and learning-based approaches. On the other hand, one of the limitations of our approach is that the target must have influential users in his/her current neighborhood to be able to receive recommendations of other influential links. Moreover, as shown by the AUC metric, ILP predicts a limited number of new links. For this reason, our approach showed a better performance than topological metrics when the number of recommendations is lower than \(\sim 100\) recommendations, but when this number increases our approach is overcome by classical approaches. We also observed some differences between the datasets used for the experimentation. We think that these differences are related to the role that social influence plays in each social network. Thus, if the social influence is high among users in the social network, our approach will obtain better results.

Future work will focus on enriching ILP with content-based information in order to recommend links to influential users in specific topics or domains.