Introduction

Interactions among individuals shape how our societies unfold and the graph of such interactions can reveal a lot about our social organisation and its evolution in time. That is why social networks have attracted a great deal of attention to understand the mechanisms underlying their evolution and provide valuable information on the microscopic determinants of social dynamics, for instance, individuals’ search strategies1,2 or the schemes to allocate time in socially charged activities3,4.

The evolution of social networks is shaped by the interplay of complex mechanisms operating at different scales. Indeed, individuals have a heterogeneous propensity to engage in social interactions, featuring heavy-tailed distributions of activity and degree5. Also, people allocate their social interactions toward similar alters6,7,8, for instance connecting to a friend of a friend-triadic closure9. At the same time, individuals may seek novel connections outside of their inner circle of contacts, based on shared interests or experiences (focal closure)4,10,11,12,13. Moreover, social networks are intrinsically dynamical systems that evolve in time14,15 as links between nodes are continuously created and destroyed16,17,18. This time-varying nature of the networks deeply affects not only their topological properties7,15,19 but also the dynamical processes unfolding on their evolving topology20,21,22,23.

The growing availability of large scale and longitudinal datasets logging human interactions allowed for the study and characterisation of the birth and evolution of social networks. This, in turn, triggered the introduction of models capturing some relevant aspects of the whole phenomenology, such as the propensity of individuals to engage in social interactions5, the correlations in the nodes’ activity patterns21,24,25, the emergence of topological correlations9,26,27, and the clustering of nodes in tightly connected communities11,28,29.

All of the models proposed so far, however, feature two main drawbacks. First, most of these models are growing models of a network, and thus they do not account for the dynamics of the network itself. Indeed, the growth is usually simulated by inserting one node per evolution step that establishes edges either by copying neighbours of a randomly selected nodes13, by rewiring existing connections30,31 or by following a topological9,26,27, information-based32 or hierarchical33 preferential attachment. Moreover, all of these models return binary networks where no weight is assigned to edges. An exception is given in34 where, however, authors rely on an initial community structure and degree distribution to reproduce the final modularity of the networks.

The second limitation applies to models describing the network dynamics, allowing nodes to interact more than once in time. Some of them require to fix some data-driven heterogeneous distribution to reproduce the real-world heterogeneity of given observables, such as the fitness associated with each node9,26 or the propensity for a node to engage in social interactions28. Other models focus on particular aspects of the network evolution, such as the drivers of the users’ activation patterns in time35 or the strategies that a node follows to select the interacting partner36.

In this work, we propose to solve these issues leveraging on the notion of adjacent possible space37,38,39. Introduced by the biologist Stuart Kauffman in the framework of molecular and biological evolution, the adjacent possible framework posits that space (be it of words, ideas or products) being explored by some agents is partitioned in three regions: (i) the actual, accounting for all the tokens that users already discovered and experienced, (ii) the adjacent possible space, encompassing all the concepts and tokens that are just one step away from what is known and could become actual in the immediate future, and, (iii) non-adjacent possible space, the set of all those things that could become possible at some later stages, conditional to a suitable expansion of the actual. The key ingredient of this framework is that once an individual experiences something new from his adjacent possible space, the new item gets immediately surrounded by fresh concepts previously belonging to the non-adjacent possible space that is now part of the adjacent possible space. Recently, some of us proposed a mathematical formalisation of the notion of the adjacent possible space40,41 where the space of tokens that can be explored by an agent (represented, for instance, as a Polya’s urn42,43) grows when he experiences a novelty, i.e., an item of the space being explored never seen before. This modelling framework allows making quantitative predictions of key statistical features of innovation processes in human activities44 and technological progress45.

Here, we apply the adjacent possible framework to describe and reproduce the dynamics of social interactions via a model relying on a minimal set of microscopic rules, i.e., defined at the individual level. We start from the intuition that an exploratory process drives the growth of a social network, as individuals expand their circle of acquaintances by exploring a space of social connections. From this perspective, individuals expand their potential network of contacts—i.e., their adjacent possible space—every time they create a new connection.

We show that our model of network growth and dynamics can correctly reproduce many features of social networks at different scales. For instance, the heterogeneous propensity of nodes to engage social interactions (activity), their degree distribution, the way a node decides to allocate a social interaction toward new or already contacted alters (i.e., the individuals’ propensity to innovate and create new connections), the dynamical growth of average degree, the local clustering coefficient, the modularity of a network and the overall rate of growth of the number of edges in it.

Results and discussion

Real-world social network datasets

We test the predictions of our theory against three different real-world social networks that have been extensively studied and characterised in the previous works21,25,28,46 and that represent a good proxy to dynamically measure interactions between people (see “Methods” and Supplementary Note 2 for details): (i) the American physical society (APS) co-authorship network, (ii) the Twitter mention network (TMN), and (iii) the mobile phone network (MPN).

The most renowned feature of these systems is that both the propensity of a user to engage in a social interaction (i.e., the activity ai of a node i defined as the number of events actively engaged by node i) and the degree ki (i.e., the number of different neighbours connected to node i) are broadly distributed. The tails of their distributions are usually approximated with a power-law, i.e.,

$$P(a)\propto {a}^{-(\eta +1)}\quad \,\text{and}\,\quad P(k)\propto {k}^{-\mu },$$

as shown in Fig. 1a, b.

Fig. 1: Stylised facts in social networks’ evolution: empirical data and model predictions.
figure 1

a The P(a) activity distribution, (b) the P(k) degree distribution, and, (c) the temporal growth of the total number of edges A(t) tγ as found in the American physical society (APS, red squares), mobile phone network (MPN, green triangles) and Twitter mention network (TMN, blue circles) datasets. In each panel, we also show the same curves as found in the best fitting model of each dataset (solid lines with the same colour as the corresponding dataset). The rescaled strengthening probability \({p}_{e}(k)={(1+k/{c}_{e})}^{-\beta }\), being ce the memory constant for nodes of class e, as measured for different classes of nodes (symbols, colour depth proportional to the number of agents in a node class e) is reported in panel (d) for the APS dataset, in (e) for the MPN case, and, in (f) for the TMN system. In the main panels, we compare the empirical curves (coloured symbols) with the pe(k) found in the corresponding best-fitting urn model (black symbols) and the theoretical guideline \({p}_{e}{({x}_{e})}^{1/\beta }={(1+{x}_{e})}^{-1}\) (black solid lines), being xe the rescaled degree xe = k/ce. In the insets, we show the average rescaled \({\left\langle {p}_{e}({x}_{e})\right\rangle }_{e}\) for the empirical (blue lines) and synthetic data (orange lines) as well as the theoretical behaviour p(x) = (1 + x)−1 (black line). We fixed ce by fitting the average pe(k) as measured for all the nodes belonging to the class e. gi The average degree \(\langle k({t}_{e},t)\rangle \propto {(t/{t}_{e})}^{q}\) as a function of the rescaled global time ge = t/te for different classes of nodes entering the system at different times te (coloured symbols) for the APS (g), MPN (h), and TMN system (i), respectively. In the insets, we show the corresponding results for the urns model. We also show the best fit \(\left\langle k(t)\right\rangle \propto {t}^{q}\) for all the cases (solid blue lines). Note that here the exponent q does not depend on class e.

These systems are also expanding in time as new nodes and edges keep entering the network. In Fig. 1c we show the growth in intrinsic time t (i.e., the number of recorded events) of the number of edges A(t) in the systems that follows a Heaps’ law as

$$A(t)\propto {t}^{\gamma },$$

where γ is the exponent leading the growth of the number of links (see Supplementary Note 3 for details).

Another feature is that individuals display correlations on their activity. When a node engages a social interaction, it is likely to turn its social activity (e.g., a mention in the TMN) toward a node already contacted in the past rather than toward a randomly selected node in the system. A possible way to quantify this mechanism is to measure the probability pi(k → k + 1) (in short pi(k)) for a node i that already contacted k different nodes to contact a new one the next time it will be active21,25. The pi(k), which is formally the probability to pass from degree k → k + 1, was found to feature the same functional form

$${p}_{i}(k)={\left(1+\frac{k}{{c}_{i}}\right)}^{-\beta }$$

across all the analysed datasets25, with a single value of β -the strengthening exponent- and a distributed, agent-depending strengthening constants ci (see Supplementary Note 3 for details). At odds with the strengthening exponent β, ci significantly varies across individuals. To account for the nodes’ dynamic variability, we grouped individuals in different classes, e, accordingly to their entrance time te into the system and their final degree ke (see the “Methods” for details). We show in Fig. 1d–f both the rescaled pe(xe), with xe = k/ce for each class e, (main panels) and the average value of the rescaled probability \({\left\langle {p}_{e}({x}_{e})\right\rangle }_{e}\) (insets), as found in the empirical and synthetic data (see Supplementary Note 3 for details).

This correlation mechanism inhibits the creation of new links, resulting in a sub-linear growth of the average degree

$$\left\langle k({t}_{e},t)\right\rangle \propto {(t/{t}_{e})}^{q},$$

for the e-th class of nodes, where t is the global intrinsic time measured as the number of total events recorded and where q < 1, as shown in Fig. 1g–i.

In Fig. 2 we show for the first time the analysis of Taylor’s law47,48 on social networks evolution data. Recently pointed out as a shared feature in evolving systems, Taylor’s law relates the standard deviation of a random variable to its mean and it measures the fluctuations in innovation rate49,50. The fingerprint of a complex dynamics is characterised by

$$\sigma (t)\propto \mu {(t)}^{\delta }$$

with δ ≥ 1, at odds with δ = 1/2 characterising uncorrelated events. Complex behaviour of Taylor’s law does not trivially follow from Zipf’s and Heaps’ laws49,50, making it a relevant observable to test theoretical predictions. We here measure Taylor’s law referred to the growth of the individuals’ connections in the network, observing a linear or superlinear behaviour (Fig. 2).

Fig. 2: Taylor’s law in social networks.
figure 2

ac Taylor’s law for the number of links created by different individuals at each user’s intrinsic time u, being u the number of events where a user took part. We defined \(\mu (k)\equiv {\left\langle {k}_{i}(u)\right\rangle }_{i}\) and \(\sigma (k)\equiv \sqrt{\left\langle {k}_{i}^{2}(u)\right\rangle -{\left\langle {k}_{i}(u)\right\rangle }^{2}}\), where ki(u) is the degree of node i at its intrinsic time u and where we implicitly have a dependence on the node i’s intrinsic time u. Results in the empirical datasets, contrasted with model predictions, for (a) the American physical society dataset (APS), (b) the mobile phone network (MPN) dataset, and (c) the Twitter mention network (TMN) dataset. In (a) we also report results for the 1-link sampled database, defined in the “Methods” section. For all the empirical and synthetic datasets we find an approximate power-law behaviour σ(k) ~ μ(k)δ, with δ 1. For each curve, we considered the 2000 individuals with the longest history (final intrinsic time). df We report the effective number of points over which the mean and the standard deviation is computed at each time t. This number decreases for the highest values of the mean, corresponding to higher values of intrinsic time since the number of the total performed actions varies from individual to individual. The number of samples used is reported, both for the empirical and synthetic datasets, for (d) the APS dataset, (e) the MPN dataset, and (f) the TMN dataset. In all the cases, we considered only averages on at least 100 samples, but for the MPN case, where we additionally show values for times where at least 10 samples were available (model 10 curves).

Besides these fundamental features, we also track a set of local and global observables, later presented in the Results section.

Model

As said, we build on the adjacent possible framework40 to model the exploration of social spaces where individuals are embedded. Within this framework, we microscopically model how the space of possibilities of a node (i.e., the set of all the social interactions that are “possible” for a node) evolves in time. This space, at a given point in time, consists of three distinct regions: (i) the actual, including all the links already experienced by the individuals in the past (current connections), (ii) the adjacent possible space accounting for all the links that are just one step away from being explored (e.g., the friends of friends not yet our friends), and, (iii) the non-adjacent possible space, accounting for all the links that may become adjacent and possible at some later stage.

A second essential ingredient of our model is the presence of correlated novelties40. Every time the social exploration process of a node i activates a new connection with a node j belonging to its adjacent possible space, i and j experience a novelty, i.e., the link eij gets active for the first time. In this way, j becomes now part of the actual region of i and the adjacent possible space of i enlarges with new possible connections that were not possible for i before. In other words, a novelty paves the way to another in the future.

The modelling scheme we apply is a multi-agent version of a modified Polya’s urn42,43 that has already been successfully implemented to reproduce the key statistical properties of complex systems (Zipf’s, Heaps’ and Taylor’s law) and their dynamical correlations40,44,50,51. In the simpler formulation of that model40, the key ingredient is an urn, \({\mathcal{U}}\), initially containing N0 distinct elements. One may think of them as balls of different colours representing an item of the space being explored. The dynamics proceed by repeatedly withdraw balls from \({\mathcal{U}}\) and annotating them in a temporal sequence of events \({\mathcal{S}}\)-here this sequence represents a sequence of social contacts experienced by a user. Every time we pick up a ball, we put it back in the urn together with ρ additional copies of it, thereby reinforcing that element’s likelihood of being drawn again in the future, in a “rich-get-richer” fashion. To account for the adjacent possible space expansion, whenever a novel (never extracted before) element appears in the sequence \({\mathcal{S}}\), we additionally put ν + 1 new distinct elements in \({\mathcal{U}}\), thus expanding the adjacent possible space of the system.

We generalise this model to a multi-agent definition to account for the birth, evolution and dynamics of social networks. To this end, we introduce two key concepts. First, the system consists of a collection of urns, each identified by a unique alphanumeric ID (a, b, c,…), representing users in a social network. Second, each ball within each urn bears the reference ID of another urn in the system. Then, the sequence of extracted balls will correspond to a series of social contacts annotated as tuples (i, j), where i is the ID of the urn drawing a ball, and j is the ID of the drawn ball. For each extraction, the reinforcement process requires to put back ρ copies of the extracted ball j into the extracting urn i (and vice-versa), so that an exploited interaction will be favoured again in the future. To account for the expansion of the adjacent possible space, we also let two urns that interact for the first time to exchange a memory buffer, i.e., a subset of ν + 1 balls that each urn shares with the other. This set is selected using a specific rule s that we present below. Thanks to this exchange, an urn that experiences a novel connection expands its adjacent possible space, thus increasing the set of IDs that it may contact in the future.

Exploration strategies

The last ingredient of our model is the strategy s that an agent adopts when sharing its experience (the memory buffer) with nodes contacted for the first time. We introduce a total of six different strategies s to determine the ν + 1 IDs contained in the memory buffer being shared along with new links. Here, we report three of these strategies that turn out to best capture the phenomenology of the empirical datasets we consider, while we refer to the Supplementary Note 1 for the other strategies. The first strategy is the weighted sample with withdrawal (WSW) strategy: an agent draws ν + 1 distinct IDs from the urn proportionally to their abundance in the urn itself at that time, i.e., proportional to the number of the past interactions with each ID. Distinct means that the node extracts exactly ν + 1 IDs proportionally to their abundance and withdraws all the balls of an ID x after it has been drawn. This strategy corresponds to sharing the IDs that interacted the most with a node in the past and is the one applied in Fig. 3. Then, we define the symmetric sliding window (SSW): each agent keeps a buffer of its last ν + 1 distinct alters with whom he interacted in the past. These represent the set of IDs shared with a newly contacted agent. After the exchange, both agents update their memory buffer by pushing in the ID of the agent just contacted and removing the ν + 1st ID from their buffers. This strategy favours the spreading in the network of the recently activated connections, rather than the most frequent ones. Finally, we introduce the asymmetric sliding window (ASW): it is a variant of the previous strategy, the difference being that only the agent that initiated the interaction updates its memory buffer after the communication event. Given these definitions, we can now define the sequential evolution steps of our model.

Fig. 3: Three possible evolutionary steps of the model.
figure 3

Three possible steps of the Polya’s urn model for a system with equal reinforcement and innovation parameters ρ = ν = 1 with the weighted sampling with withdrawal sampling strategy s = WSW. For each evolutionary step of the system (columns), we show the current state of the urns (top row), the equivalent network evolution (mid-row), and, the sequence \({\mathcal{S}}\) of observed events (bottom row). In the network, we show already active links (solid lines), links in the adjacent possible space (dashed lines), currently active links (red lines) and connections entering into the adjacent possible space (blue dashed lines). The ball drawn is visualised with a red dashed circle while we show reinforcement balls and memory buffer ones enclosed in black and blue rectangles, respectively. New urns entering the system are shown with dashed borders. a At time t urn a is active and draws the ball b: the event (a, b) is then appended to the sequence \({\mathcal{S}}\). At time t + 1 the urn a then gains ρ copies of b and vice-versa (reinforcement) and, since the eab link is new, we also draw ν + 1 distinct balls from a following the WSW strategy (balls c and d within the dashed rectangle) that will be copied into b (and the same for b that sends e and f as novelties to a). In the network representation, the eab edge is active and the eae, eaf, ebc, and ebd links enter into the adjacent possible space. Notice that the adjacent possible of c changed without the need for c to participate in social interaction. b At time t, urn a draws a copy of b (top). Since the edge eab was already active in the past, we only put ρ copies of b in urn a and the other way around. The network’s topology does not change in this step, while the weight of the eab link gets increased (network representation). c Urn a draws a copy of c at time t (top). Since c is an empty urn, it creates ν + 1 novel IDs (g and h, in the dashed rectangle) and gains a copy of them. We add (a, c) to the sequence \({\mathcal{S}}\) and we perform the reinforcement/novelties exchange between a and c. The network gains two new nodes (g and h), activates a new edge (eac) and inserts new links in the adjacent possible space. The actual space of c acquires a while its adjacent possible space gains e, g, and h.

Model rules

The three parameters of the model are the reinforcement parameter ρ, the number of novelties to be shared ν and the memory buffer exchange rule s. A schematic representation of the model is given in Fig. 3 and we resume here the steps defining it (see the Supplementary Note 1 for details):

  1. (1)

    we start with two urns, a and b having a copy of each other’s ID inside of them; the urns also contain the ν + 1 distinct identities (IDs) of other urns that did not participate yet to any interaction (c, d for a and e, f for b). These sets represent their initial memory buffers. The sequence of events \({\mathcal{S}}\) is initially empty;

  2. (2)

    at each time step, we extract a “calling” urn i proportionally to the size of the urn Ui (the number of balls within the urn i). We then draw a ball from the calling urn i, say the ID j. This double extraction corresponds to a single event (i, j) that we append to the main sequence \({\mathcal{S}}\). In Fig. 3 the first event is the (a, b) one.

  3. (3)

    reinforcement: following the event (i, j), we add ρ copies of i in the j’s urn and ρ copies of j in the i’s urn. For example, in Fig. 3a, b, we add ρ copies of a in the b’s urn and ρ copies of b in the a’s urn.

  4. (4)

    novelty: if it is the first time that i and j interact, i and j exchange their memory buffer. With this mechanism, we add j’s memory buffer into Ui and, vice-versa, i’s memory buffer into Uj. In Fig. 3a, a’s memory buffer (c, d) is copied into Ub and b’s memory buffer (e, f) is copied into Ua. In this case, the memory buffer is determined using the s = WSW strategy, that is, by extracting exactly ν + 1 IDs from Ua proportionally to their abundance.

  5. (5)

    if a node j is called for the first time by another node (i.e., j is an empty urn so that Uj = 0), it creates ν + 1 new agents (empty urns) and, for each of them, it creates a ball into its urn: these ν + 1 IDs represent the initial memory buffer of j. In Fig. 3c node c creates two brand new nodes, g and h, that will represent its initial memory buffer. We note here that the newly created agents are initially empty urns so that they can participate in the dynamics (they can be included in the social network) only if another urn (agent) calls them. Only after this first call, they may actively engage in an interaction. In this scheme, an agent cannot join the network “from outside,” i.e., unless it is engaged by another agent already belonging to the network. Of course, the scheme can be generalised to account for other schemes for nodes to enter the system but we will not cover this detail in the present work.

Each evolution step is defined as a repetition of the 2 → 5 steps of the just outlined procedure, as shown in Fig. 3. The parameters ρ and ν weigh the relative importance of the reinforcement and exploration processes in the system. We define R = ρ/ν as the ratio between the two. The model is then entirely defined by three parameters only: the reinforcement value ρ, the ratio R = ρ/ν setting the relative importance of the reinforcement (exploit) and novelties (explore) mechanisms, and the strategy s used to exchange the memory buffer between nodes getting in contact for the first time.

Let us note that our model does not only represent a network growth model. Instead, it reproduces a sequence of interaction events that mimic the original network dynamics, i.e., the actual sequence of events. This feature is at variance with other models of network growth where each evolution step corresponds to the insertion of a new node establishing its connections following a given rule9,13. Moreover, the creation of new edges and the reinforcement of their weights stem from the exploration that each user does of the possible connections and the expansion of the number of potential acquaintances rather than from local rewiring schemes30,31. In the following, we show that, after fitting the model’s parameters to the three different networks we analyse, the model’s simple microscopic rules reproduce the main topological and dynamical features of empirical networks. Besides, the values of the three model’s parameters give some insights on the nature of the three social networks we are analysing.

Comparison between model and datasets

We now compare the main network features emerging from our modelling scheme’s evolution to their empirical counterpart. To compare the theoretical predictions with the observed data, we optimised the model by fixing the parameter values that minimise, for each empirical dataset, a cost function Sd(ρ, R, s). The latter evaluates the goodness of fit of the synthetic simulations to the empirical dataset d by looking at eight selected observables, both local and global, topological and dynamical (see the “Methods” for details). Specifically, we consider (i) the strengthening exponent β, (ii) the q exponent leading the average degree growth, (iii) the γ exponent driving the growth of the number of edges growth in time. We also take into account (iv) the asymptotic value of the average clustering coefficient c and the fraction of activated edges that are either old (already activated in the past) or new (being activated now) and that happen to insist on a triangle (closed) or not (open), thus defining four categories: the (v) old open (OO), (vi) the old closed (OC), (vii) the new open (NO) and (viii) the new closed (NC) active links, measuring the fraction of events falling in each category per time range in the asymptotic limit of the system evolution (see the “Methods” section). We summarise the results in radar plots in Fig. 4, showing the observed values for the eight selected observables along with their best numerical estimates. We observe that the model endowed with optimal values of the parameters is able to quantitatively reproduce all the selected observables in all the datasets, exception made for the APS dataset. In the latter, the model fails to predict the observables related to the network topology correctly. To explain this discrepancy, we note that the APS dataset is composed of cliques of events—rather than one single event between two IDs per time. This feature leads to a high clustering coefficient (as all the agents publishing one paper are fully connected) and in an increased count of events observed along old edges insisting on at least one closed triangle. To filter out this effect, we performed a sub-sampling of the data by drawing a single link among all the possible ones for each paper and re-computed the features of this sub-sampled dataset (see the “Methods” for details and the Supplementary Note 4 for other sampling strategies that give similar results). When analysing these results, the disagreement between the real system and the model results disappears (see Fig. 4a, b), revealing that the model can explain the underlying interaction processes also in this dataset.

Fig. 4: Eight selected observables for model’s optimisation.
figure 4

ad Radar plots comparing eight selected observables measured in empirical (blue lines) and synthetic data (red lines). In each panel, we report the reinforcement and novelties parameters ρ and ν and their ratio R as well as the optimal sampling strategy s, that is: (a) American physical society (APS) with the symmetric sliding window (SSW) strategy s, (b) the 1-link subsampled APS with the asymmetric sliding window (ASW) strategy, (c) the mobile phone network (MPN) fitted with the ASW strategy, and, (d) the Twitter mentions network (TMN) with the weighted sampling with withdrawal (WSW) strategy.

In the Supplementary Note 1, we give an analytical solution of the model and we show how each model parameter affects the main observables of the system. Here we report that as for the global observables discussed in “Results and discussion” and reported in Figs. 1 and  2, the relevant parameters are the ratio R = ρ/ν and the sharing strategy s, while the absolute values of ρ and ν impact the behaviour of the observables related to the local topology of the network. Despite the limited number of parameters, the model is flexible enough to reproduce a wide range of phenomenologies, from highly exploratory situations such as the APS dataset (high γ and low β exponent) to more exploitative scenarios, with a reduced number of connections being explored such as in the MPN (with a large β and a low γ exponent).

Global trends

Besides the eight observables included in the cost function Sd(ρ, R, s), in Fig. 1 we show that the model can also reproduce the empirical networks’ broad activity (panel a), the degree distributions (panel b), the growth of the number of edges in the network E(t) tγ (panel c), the functional form of the strengthening function pe(k) (panels d–f), as well as the sub-linear growth in of the average degree (panels g–i). Let us note that the average pe(k) curves presented in the insets of Fig. 1d–f feature different support in the rescaled degree xe = k/ce. In particular, the APS and TMN empirical data span a larger rescaled degree range with respect to the synthetic data, while the opposite is true in the MPN case. This difference is due to the discrepancy in the P(ci) distribution of the strengthening constant ci in the empirical populations and the synthetic ones, as we show in the inset of Fig. 5a. Indeed, our model reproduces larger (smaller) ci in the APS/TMN (MPN) cases compared to the empirical situation. This results in a smaller (larger) range of the rescaled degree (ke) as shown in Fig. 1d–f.

Fig. 5: Topological properties of social networks: empirical data and model predictions.
figure 5

In all the panels, we show the data of the empirical datasets using symbols and data from the synthetic simulations with solid lines. To distinguish the datasets, we use the red colour for American physical society (APS), cyan for the APS 1-link subsample dataset, green for the mobile phone network, and blue for the Twitter mentions network dataset. a The P(α) distribution of the local Heaps’ exponent α for a sample of 10% of nodes as measured in the empirical networks and the simulated ones. In the inset, we show the P(ci) distribution of the individuals’ strengthening constant ci as measured in empirical data (coloured boxes) and their corresponding simulations (black boxes). b The P(w) link weight distribution for the three empirical datasets (symbols) and the ones found in the artificial networks (solid lines). c The average overlap in the network obtained by removing edges in ascending order of weight up to a given cumulative density function value (percentile). d The average clustering coefficient c(fO) measured on the network obtained by removing edges being in the overlap percentiles O ≤ fO. e The distribution P(r) of the community sizes r found in the different datasets. In the legend, we show the network modularity m as measured in each dataset. f The distribution of the fraction of core nodes within the communities of the different networks. We compare the empirical results (blue distributions on the left) with the synthetic ones (orange areas on the right, the dataset names are reported on the x-axis).

In the insets of Fig. 1g–i, we also show that the model can reproduce the average degree growth in time \(\left\langle k({t}_{e},t)\right\rangle \propto {t}^{q}\) for the different node classes. The q exponent reproduced by the model is not always in perfect agreement with the empirical value (e.g., q = 0.56 in the TMN while q = 0.63 in the best fitting model), but in the Supplementary Fig. 13, we show that the distribution P(qe) of the exponent per node class is in very good agreement between the empirical and synthetic cases.

Furthermore, the model can reproduce the broad fluctuations in the rate of innovation (i.e., in the creation of new links), as measured by Taylor’s law (Fig. 2). In the Supplementary Note 3G we also show that the model qualitatively reproduces the inter-event distribution between two consecutive events of edges creation for a node as well as the entropy of such sequence. These results represent the first improvement of our model with respect to the state of the art as we do not require (i) to manually specify a broad fitness distribution of the nodes to reproduce the heterogeneous degree distribution as in ref. 9, (ii) to specify the probability to contact a new node instead of an already contacted one as in ref. 28, as the strengthening behaviour of pe(k) naturally emerges from the microscopic evolution rules, and (iii) to set the selection rule of the next active node i and the probability distribution of the alter j that he will contact, as they emerge from the model as opposed to refs. 26,34.

Heterogeneities in the experience of the new

Besides the properties already exposed, the model also correctly captures the heterogeneous propensity of individuals to establish new connections, i.e., the rate at which they experience novelties. To quantify this rate, we look at the exponent of the Heaps’ law describing the growth of the degree of an individual, ki(xi), i.e., the number of distinct people encountered as a function of the number of social events performed xi: \({k}_{i}({x}_{i})\propto {x}_{i}^{\alpha }\). Figure 5a reports the distribution of empirical exponents α for the three datasets considered. These distributions are peaked at different \({\bar{\alpha }}_{d}\) values for the different datasets (\({\bar{\alpha }}_{{\rm{APS}}} \sim 0.9\), while \({\bar{\alpha }}_{{\rm{TMN}}} \sim 0.7\) and \({\bar{\alpha }}_{{\rm{MPN}}} \sim 0.4\)). Remarkably, the model correctly reproduces both the peak value and the broadness of each empirical P(α) distribution.

Another empirical quantity heterogeneously distributed is the strengthening constants ci, setting the probability for an individual with k connections to acquire a new one as \({p}_{i}(k)={(1+k/{c}_{i})}^{-\beta }\). The inset of Fig. 5a shows that our model qualitatively reproduces the empirical distribution P(ci) of the strengthening constants ci, even tough with some discrepancies between the empirical and synthetic data. In particular, the variance of the synthetic values is limited by the strict evolution rules of the urn, that cannot indefinitely diverge from the average behaviour. That is why all the three best-fitting models feature a similar P(ci) distribution (all peaked at around ci ~ 1) whereas the empirical cases show smaller ci in TMN and APS, while larger ci are observed in the MPN case. A possible extension of the model should then try to address this drawback. However, the model automatically reproduces a distributed propensity of individuals to decrease their social exploration at a given cumulative k. This is at variance with previous works that explicitly encoded this rule in the model definition28,35.

Topological correlations

We now focus on the topological correlations of the empirical and synthetic networks of interactions. In Fig. 5b we show that the model correctly reproduces the overall link weight distribution P(wij), i.e., the distribution of the number of activations of a single edge wij. Then, we test that both the empirical and the synthetic data obey the weak and strong ties scheme of the Granovetter conjecture2,12. The latter states that links in a social system will be arranged to have communities of individuals tightly connected by strong ties and with a large neighbours overlap. These communities are then interacting through weak ties, i.e., links acting as bridges between communities composed by nodes sharing a limited number of common neighbours (low overlap). To prove this, we measure how the overlap Oij of two nodes (i.e., the fraction of common neighbours of nodes i and j with respect to their total number of neighbours) correlates with the weight wij of the edge between i and j. In Fig. 5c we show how the average edges’ overlap varies as we filter out network’s edges with a weight smaller than a given percentile w. In all the cases we observe a positive correlation between the two quantities at high values of w, the only exception being the APS case (red symbols). As for the previous clustering analysis, we show that once we subsample the APS events, the empirical data agree with numerical simulations (cyan symbols and line). Also, in this case, our model is able to reproduce a non-trivial arrangement of the edges’ weights and their topology. In the Supplementary Note 3H we compare other topological measures finding a good agreement between the empirical and synthetic cases (positive degree assortativity and negative local clustering-degree correlation).

We further test if our model reproduces the community structure (modularity) of the empirical networks. In Fig. 5d, we inspect how the average clustering coefficient c(fO) of the network varies when removing the edges eij by their ascending overlap, i.e., by discarding edges with overlap Oij ≤ fO, being fO the cumulative percentile of the overlap distribution. We find c(fO) to increase as one removes edges with small overlap, indicating that the removal of weak ties is removing bridges between communities. Then, the c(fO) peaks and, if we keep removing the higher overlap edges, we start breaking the triangles in the communities’ cores, the clustering coefficient decreases. Again, the disagreement between the empirical and synthetic APS dataset disappear if we consider the 1-link sub-sampled dataset. As shown in Fig. 5d, the subsampled dataset perfectly matches the corresponding numerical data.

Then, in Fig. 5e we show, for each dataset, the P(r) distribution of the community size r as found in the empirical networks (symbols) and the numerical simulations, together with the modularity values of the networks (see the “Methods” for details). Surprisingly enough, the model can reproduce the communities’ size distribution and the modularity values for the APS (both original and sub-sampled) and the TMN datasets. In the MPN dataset, the synthetic network’s modularity is found to be smaller than the empirical one. This finding can be due to the limited size that the network reaches in the simulation time (~2500 nodes versus the millions of the real case) and the fraction of time steps that we can simulate compared to the real case (107 simulated steps versus the 2  109 real events). Nevertheless, the overall shape of the communities size distribution is also reproduced in this case.

We then check if the generated networks also reproduce the empirical core-periphery structure. Intuitively, a core-periphery structure corresponds to a network with communities composed by a densely connected set of nodes (the core) to which other nodes sparsely connected amongst them are linked (the periphery, see Methods for details). We report the results in Fig. 5f where we show the distribution of the fraction of core nodes detected for each community. As one can see, the model correctly reproduces even the proportion of core and periphery nodes in communities.

Finally, passing to the microscopic dynamics, we report the temporal evolution of the number of events allocated by nodes toward new or old links insisting or not on triangles (the categories OO, OC, NO, and NC defined above and in the “Methods”). Again, we note that, while the original dataset of APS shows a behaviour not well reproduced by the model, the latter very nicely predicts the behaviour featured by the 1-link sub-sampled APS dataset, as well as the behaviour featured by the other two datasets (Fig. 6).

Fig. 6: Dynamics of links formation: empirical data and model predictions.
figure 6

The plot of the empirical data and model results for the fraction of activated links that are: old (i.e., they have been active before) and open (they close no triangle) OO(t) (blue markers), old and close (i.e., they close at least one triangle) OC(t) (orange markers), new (i.e., they get activated for the first time) and opens NO(t) (green markers), and new and close NC(t) (red markers). In each case, we show the temporal behaviour (markers), and the asymptotic value (dashed lines, the asymptotic value is reported in the legends). Panels refer to (a) American physical society (APS) and (b) its model, (c) 1-link subsampling of APS and (d) the synthetic model, (e) the mobile phone network (MPN) dataset and (f) it is corresponding numerical one, and (g) the Twitter mentions network TMN dataset with (h) the synthetic one.

These additional results overcome some of the limitations of previous models and approaches. For example, in34 the community structure has to be fixed from the beginning of the simulation, whereas other models where the community structure emerges from the network growth do not account for either the edges weights9,13,33, the network dynamics36 or the topological correlations of weights and overlap35. Moreover, none of the network dynamic models we are aware of can reproduce this large set of dynamical and topological observables at once. Either they focus on the link weights allocation28, on the activation patterns of nodes35 or they do not account for global observables such as community structures and modularity at all36.

Optimal exploration strategies

Let us finally note that the optimal parameters values found for each dataset draws some meaningful insights on the microscopic mechanisms driving the exploration of the social space in the different context of each social network. In the TMN case, we find R = 1, so that the reinforcement and the novelty exchange processes equally influence the single agents’ exploration process: this is reasonable in a system where new connections require little effort from the user. Moreover, the strategy s = WSW (weighted sampling with withdrawal) with ν = 5 is the one that better describes the empirical data: users select new accounts to mention by sampling from the past interactions of the alters they are connecting with proportionally to the number of their past interactions.

On the other hand, in the MPN case, the best fit is obtained for R = 3 and ρ = 21. The system dynamic is dominated by reinforcement processes that tend to reinforce links that are first established and inhibiting the creation of new edges. In this case, the best fit with the s = ASW memory buffer sampling strategy highlights that individuals share their last ν + 1 = 8 contacts, thus spreading copies of recently contacted IDs rather than the most contacted ones. Notice that the last contacted ν + 1 IDs may, in general, be different from the most representative IDs within the urn. The asymmetric nature of the ASW strategy indicates that users actively exploring new connections update their memory buffer, whereas nodes passively participating in communication tend to conserve their previous memory buffers.

Finally, in the APS case, we find an extremely exploratory dynamics characterised by a relatively low R = 0.4, i.e., a relatively high ν. This finding is symptomatic of a dynamics where the exploration of the social space overtakes the reinforcement of existing connections. A possible explanation lies in a large number of students and researchers authoring a few papers before quitting academia, providing a constant influx of new potential connections to be explored by senior researchers. The SSW optimal sampling strategy reveals that authors tend to share their last ν + 1  16 people they have been collaborating with, implying a preference to recommend recently active connections to new collaborators. Moreover, this strategy also catches the intrinsic symmetric nature of the co-authorship interaction, as both co-authors update their buffers of potential new collaborators. Further confirmation of this is that the sub-sample of the APS dataset is best fitted by a comparable R = 0.33 but with an ASW strategy s = ASW, where only the node actively engaging the interaction updates its memory buffer.

Conclusion

In this work, we proposed a theoretical model of social exploration to explain the birth and evolution of social networks. The theory is based on the adjacent possible framework and builds on a recently introduced mathematical formalisation of its conditional expansion. In this framework, the creation of new social bonds is the outcome of an exploration process unfolding on the space of possible new acquaintances, whose boundaries change while people explore them.

Without relying on unnecessary assumptions, our theoretical model builds on the adjacent possible space expansion and microscopic evolution rules that let emerge both microscopic and macroscopic features of real-world social networks. We compared the predictions with the empirical data from three diverse social networks: the network of mentions between Twitter users, the network of co-authorship of the APS, and a mobile phone-call network. The agreement between theory and data is surprisingly good. On the macroscopic side, the model reproduces the main static and dynamic features of those social networks: the broad distribution of degree and activities, the average clustering coefficient, and the innovation rate at the global and local levels. At the microscopic level, the most striking feature captured is the probability for an individual, with already k connections in its local network, to acquire a new acquaintance. The model also captures the topological correlations, the modular structure and the core-periphery organisation of nodes. Besides, the model is able to grasp the temporal-evolution of real-world systems at very different scales, from the local exploit/explore mechanisms of single agents to the global organisation of the network in communities of coherent users. To the best of our knowledge, this is the first model that reproduces these features without superimposing a heterogeneous fitness distribution to nodes, a specific rule to choose whether to interact with a new or old alter or to set the modularity of the initial network manually. Moreover, it is the first attempt to reproduce both the network growth (in terms of nodes and edges) and its dynamics (the sequence of activation events of the edges in time) at once, giving results in excellent agreement with the empirical case.

Besides being able to capture very complex features of social networks quantitatively, our theory also allows us to deepen our understanding of the microscopic mechanisms shaping the propensity of people to reinforce old contacts or establish new ones. For instance, in the Twitter mentions network, we find the exploration and reinforcement processes to be of equal importance. Moreover, when getting in contact with new alters, users share a sample of their most common contacts as new potential connections. On the other end of the spectrum, in the mobile phone-calls network, people reinforce their existing bonds more than they explore new ones. When suggesting new potential contacts to others, people tend to exchange their most recent contacts, rather than their most common ones. Finally, the network of scientific co-authorship of the APS journals features the most exploratory dynamics, with new connections massively expanding the adjacent possible space of a single node. In this case, people preferably share their last contacts, and the optimal synthetic update procedure is symmetrical, correctly reproducing the intrinsic symmetric nature of the interactions.

The theoretical framework proposed here is, of course, open to possible improvements. First, the simulated dynamics describes the evolution of a system from its outset. The initial conditions set here could be far from those of the real-world systems considered. Despite the excellent agreement with empirical data, a more comprehensive study on the dependence of the system evolution on the initial state is in order. Other generalisations could concern the possibility to remove links52 or to generalise the model to have a separate rate of nodes and links entrance in the system. Finally, our modelling scheme does not account for effects connected to semantics or affinity between people. For instance, it seems reasonable to assume that people create bonds and interact based on shared interests or their level of homophily53,54. The generality of the approach presented here will make the extension of the theoretical framework desirable and possible along these lines. Also, we restricted to the case when nodes exchange their memory buffer only on their first encounter, whereas other strategies may be considered. Another important extension is to test the predictive potential of the model fitting it on a subset of the points and test its forecasting ability to determine which nodes will be in contact (and with which strength) in the future.

We believe that the presented framework, together with its predictions validated on real-world social networks, represents a valuable step toward understanding the processes underlying the birth and evolution of social networks. It further creates an important bridge between network theory and urns models, opening the way to constructive contamination between the two fields and full exploitation of results derived for stochastic processes relevant in innovation dynamics50,51,55. This development, in turn, unlocks the possibility to grasp the very essence of social interactions and allows for the design of efficient and informed policies to address crucial challenges dealing with collective processes ongoing in social networks, such as the spread of diseases and online misinformation.

Methods

Data and code

Here, we summarise the empirical data used to test the predictions of our theory. These are three different real-world social networks: (i) the APS co-authorship network generated by all the papers published in all the APS journals from January 1970 to December 2006. (ii) TMN logging all the mentions between users recorded between January and September 2008. (iii) MPN recording the calls between users of a national provider in an undisclosed European country between January and July 2008. We refer also to the Supplementary Note 2 for details. These datasets represent diverse contexts of social interactions, making them an ideal set of empirical observations to test the universality of our model. In particular, the APS dataset describes the undirected interactions of co-authors of scientific papers56,57,58,59. Here interactions have a high cost in terms of time and resources. The TMN dataset reports the directed citations of a user i citing a user j (that corresponds to an edge from i to j) between users of the micro-blogging platform, in which interactions are requiring few resources and can be virtually established from and to any node in the network60. Finally, the MPN dataset lies somewhere in between: communication is not as cheap as in the TMN but still easier than in the APS case8. Also, the network may be not single scoped for the users taking part in it: some of them may use it to call close contacts whereas others may use the phone for business reasons17,25. Let us also note that the TMN and APS datasets account for the growth of the two systems since their onset. Indeed, the effective onset of user adoption for Twitter occurred during 200861, whereas the APS created the majority of its journals in 1970. This circumstance ensures a unique testbed for a model of network growth. On the other hand, the MPN situation is more subtle as we have only a limited observation window on a system that underwent a long evolution period beforehand. Summarising, the three datasets used in the study are:

  • The co-authorship networks found in the Journals of the APS59 covering the period between January 1970 and December 2006 and containing 301,236 papers written by 184,583 authors that are connected by 995,904 edges.

  • TMN, containing all the mention events exchanged by users from January to September 2008. The network has 536,210 nodes performing about 160 M events and connected by 2.6 M edges;

  • MPN composed of 6,779,063 users of a single operator with about 20% market share in an undisclosed European country from January to July 2008. The datasets contain all the phone calls to and from company users, thus including the calls towards or from 33,160,589 users in the country connected by 92,784,825 edges.

  • The synthetic simulations have been run for T = 106 evolution steps for configurations with R ≤ 1, T = 5  107 otherwise. We performed ten independent simulations for each set of parameters. We define as the best set of parameters the set featuring the lowest average cost function on those ten replicas. In the figures, we report the figures and behaviour of the run that gave the minimum cost function among these ten runs.

The code used to run the simulations, all the analysis code, as well as the synthetic data analysed, are available in62. Due to data policies and IPR, we cannot share the MPN data, while the APS data are from the work in59 and the TMN data are made available on figshare (see link in the Data availability section) . Finally, let us note that we used these three datasets despite the wealth of longitudinal network datasets present in the literature because we needed the complete sequence of events to measure and fit all the dynamical observables.

Asymptotic behaviour of the system

In this work, we leverage on a previous analysis performed on the same datasets as found in25. Specifically, we measure the strengthening probability pi(k), i.e., the probability for an individual i who already contacted k distinct individuals in the past to contact a new one (i.e., a new node of the network). To average this probability on homogeneous classes of people, we divide the nodes in e = 1, …, E classes depending on their time of entrance in the system, te, and their final degree ke, one class for each combination of te and ke. The functional form of the probability pe(k) is found to depend on class e of the nodes as \({p}_{e}(k)={(1+k/{c}_{e})}^{-\beta }\) with a single overall β exponent and a distributed reinforcement constant ce. The latter is fixed by computing, for each k the ratio between the number of events resulting in a degree increase from k to k + 1 performed by the nodes belonging to the class e and the total number of events performed by the same set of nodes at degree k (thus including the ones toward already contacted nodes). As for the growth of the average degree \(\left\langle k({t}_{e},t)\right\rangle\), we measure the average degree at time t > te for all the nodes belonging to the class with entrance time te. In this way, we are defining a new set of classes only defined in terms of the entrance time te. Asymptotic behaviour is found to be \(\left\langle k({t}_{e},t)\right\rangle \propto {t}^{q}\).

Model cost function

We ran the model at different values of R and ρ for each one of the six sample strategies s (see Supplementary Note 3 for details). For each dataset d we select the configuration that best fit the data by minimising the cost Sd(ρ, R, s) that reads

$${S}^{d}(\rho ,R,s)=\mathop{\sum }\limits_{i=1}^{8}\frac{| {o}_{i}^{d}-\tilde{{o}_{i}}(\rho ,R,s)| }{{\sigma }_{i}^{d}},$$

where \({o}_{i}^{d}\) and \({\sigma }_{i}^{d}\) are the value and uncertainty on the ith observable of the empirical dataset and \(\tilde{{o}_{i}}(\rho ,R,s)\) is the value of the same observable measured in the simulations with configuration (ρ, R, s). The eight selected observables are: (1) the exponent γ leading the growth of the number of edges E(t) tγ, (2) the optimal β measured in the strengthening function p(k), (3) the average clustering coefficient c, (4) the exponent leading the growth of the average degree per node class \(\left\langle k(e,t)\right\rangle \propto {t}^{q}\), (4)–(8) the fractions OO, OC, NO, NC of events allocated toward old/new link insisting or not on an open/closed triangle.

APS subsampling

In the APS dataset, we transform each paper published by n authors in a sequence of E = n(n − 1) events with all the possible links between all the ordered couples of co-authors. We then sample l links over the E possible links for each paper to be inserted in the total sequence \({\mathcal{S}}\). The results reported in the main text refer to l = 1, and the reader can refer to the Supplementary Note 4 for results with different values of l and different strategies of subsampling (number of sampled links proportional to E).

Modularity and core-periphery measures

Communities and modularity values have been found using the Infomap algorithms in the Python iGraph module63, using the edges weights wij as the weight parameters and with ten trials. For computational reasons, we restricted ourselves to a sub-graph induced by a sub-sample of 100,000 nodes in all the analysed cases (both empirical and synthetic). These sub-samples have been determined using the sampling method found in64 in its deterministic version. Indeed, this method returns a set of nodes whose subnetwork best reproduces the modularity, clustering and overlap features of nodes in a network. In the core-periphery analysis, the algorithms find another set of communities as this algorithm uses a configuration model as its null model. However, we show in the Supplementary Note 5 that the P(r) distribution of community sizes is consistent with the one found by the Infomap algorithm. For the core-periphery analysis, we run the model introduced in65 with its C++ implementation found in66 against the same sub-graphs of the previous analysis. Also in this case we use the edges’ weights wij as the weight passed to the algorithm.

Events on new-old and open-closed edges

We count, for each logarithmically spaced time interval, the number of events happening on edges that are either old (already activated in the past), new (being activated now) and that happen to close a triangle (closed) or not (open). These four categories are then: the OO, OC, NO, and NC that we define as the fraction of events falling in each category per time range in the asymptotic limit of the system evolution—i.e., after 60% of the events passed.