Introduction

A legal entity is a juridic term that designates an individual, company, or organization that has legal rights and obligations. In the standard terminology, legal entities are usually divided into individuals or corporations (e.g companies). The economic entity principle, stating that financial transactions must be assigned to a specific business, is considered as one of the fundamental principles of accounting: each entity must have separate accounting records, except for its subsidiaries. This principle implies that the nature and clarity of the information describing legal entities is important for compliance standards. Several types of interactions can exist between these entities, for instance payments or capitalistic property, defining different network structures. Such networks have already been studied, namely the interbank market (Boss et al. 2004) and interbank payment flows (Soramäki et al. 2007). A study on systemic risk in the interbank network, where payment interactions are treated as a complex network was described in Lenzu and Tedeschi (2012). Recent work attempted to describe the actual topologies observed in the financial system (Inaoka et al. 2004).

In this work, we propose a methodology built on recent developments in network science, to analyze a worldwide capitalistic ownership graph, and provide results on industrial data (described in Dijk (2018)). After an initial analysis using centrality measures, we propose an influence maximization (IM) algorithm in order to target specific nodes that have high importance in the ownership graph. This work significantly extends an earlier and preliminary analysis in Khalife et al. (2019). Specifically, this article contributes:

  • a development and application of influence maximization explanation to the study of capitalistic graphs

  • an extensive empirical evaluation comparing recent methodologies and techniques from network science to this area

  • extraction and analysis of patterns relating to capitalistic ownership

  • an in-depth discussion of the results, and

  • a number of recommendations for future work.

Ownership network and influence maximization

Network model

We denote a capitalistic graph \(G=(V,E)\). A vertex \(i \in V\) represents a legal entity, an oriented edge \(e_{ij}\) with weight \(w_{ij}\) represents capitalistic property: an edge from i to j means that i owns capital of j in proportion \(w_{ij}\). \(\mathcal {N}^{+}_i\) and \(\mathcal {N}^{-}_i\) are respectively the out-neighbors and in-neighbors of vertex i (i.e., set of nodes connected to and from the vertex). |I| is the cardinality of a finite set I. For mathematical convenience, we impose the following constraints:

$$\begin{aligned} {\begin{matrix} &{}\forall e \in E, \, w_e \in [0,1]\\ &{}\forall i \in V,\, \sum _{j \in \mathcal {N}^{-}_i} w_{ji} \le 1 \end{matrix}} \end{aligned}$$
(1)

In Sect. 3.1, we describe how to ensure Eq. (1). Following standard juridic terms, there are two types of entities in the graph:

  • Natural person: individual human being

  • Legal person: incorporated organizations including corporations, government agencies; or non-governmental organizations. A legal person is composed of natural persons, but has a distinct juridic identity.

We suppose there does not exist inner links between the subgraph of natural people, even though natural person can have influence over others (we suppose this link is not explicit so it is not considered as an edge in the graph). We are interested in studying the influence of legal entities, so the distinction between natural entities and organizations raises an important discussion for the evaluation of such influence.

In view of the instance at our disposal, and in particular the fact that the data is poor with regard to personal attributes, we decided to consider the first situation, that is to say to keep only the links between organizations. Furthermore, unlike market payment graphs which are very dynamic, the capitalistic graph evolves across a relatively long time scale and significant changes do not occur frequently (except rare events, such as the onset of an economic crisis). Therefore, we consider the capitalistic graph constant for our analysis and do not analyze its dynamic aspect, although this may be considered in future work. The data is provided by Bureau Van Dijk (2018), and lists all the physical and legal entities. Each of these entities has corresponding metadata: Name, location (country and continent), and description. They also have a weighted capital property over a list of other entities. The original data format is in a non-relational form; software development was necessary to parse and load data into a graph. For reasons of confidentiality, the data has been anonymized. Supplementary description is in Sect. 3.

Previous work

Location-sector analysis of the Orbis network has been proposed to identify and locate important entities (Nakamoto et al. 2019) and, in a 2015 snapshot, to analyze the location and sector of conduit firms likely to be used for treaty shopping (Nakamoto et al. 2019). In Heemskerk and Takes (2016) take a data-drive approach of community detection through modularity maximisation. In this work, we consider a different methodology in order to measure the capacity of some entities to influence economically the network. Differently, we here model an entity and its subsidiaries as separate, making the assumption that the influence of an entity is also spread through its subsidiaries in the diffusion. The macroeconomic indicators aforementioned will then be revealed using adequate centrality measures and influence maximization algorithms. Finally, in comparison with those references, we use a more recent version of the network (Orbis 2018). We defer a more in-depth discussion of the wider financial-analysis literature to Sect. 4.

Motivation and justification of the methodology

The notion of influence we consider concerns the capacity of an entity to influence economically and socially on the other entities of the network, and will be defined at different scales in our study. Firstly, at the level of legal entities, through the possession of capital, and then, using aggregation of attributes detailed in Sect. 2.7, at country and sector levels.

The motivations to analyse the knowledge graph under the prism of centrality measures and entity influence is two-fold. First, the detection of influential communities of entities has a natural interpretation in economic and financial terms. Second, the existence of low time complexity algorithms available for this purpose. Indeed, the considered network being very large, any non (quasi-) linear time complexity algorithm would result in computational time issue. Besides, several possibilities exist in order to model the influence of nodes, for instance with the influence interdiction problem (Omer and Mucherino 2020). However to the best of our knowledge, other formulations do not have linear time algorithm to obtain good approximate solutions efficiently. On the contrary, centrality measures (k-cores) and some influence maximization algorithms, which are relatively straightforward methods, have linear time complexity have proven to be very efficient in the detection of influential communities. In the remainder of this section we present the main components of our analysis in order to describe the topology of the graph, and to measure influence of entities, specifically:

  • Degree distribution and connected components

  • Centrality measures and K-cores

  • Rooted influence graph

  • Aggregation by attribute

  • Influence maximization

Later, in Sect. 3, we provide the results of the analysis, with respect to location and sector.

Degree distribution and connected components

In the following A is the adjacency matrix of the network of the graph \(G=(V, E)\).

$$\begin{aligned} A_{ij}= {\left\{ \begin{array}{ll} w_{ij} &{} \text {if }e=(i,j) \in E\\ 0 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(2)

The inner-degree and outer-degree of a vertex i are respectively the sum of coefficients in column i and row i. The degree distribution is a fundamental and important measure in complex networks (especially Internet Siganos et al. 2003), because it allows to classify graphs according to the type of distribution (e.g power law graphs). In the following, we will consider degrees of the non-weighted graph, so that the in and out degrees of vertex i are respectively the number of in and out neighbors (i.e number of non-zero coefficients on the column and row i of A). A component \(W \subseteq V\) of G is said connected if for each i and j in W, there exists a path from i to j.

Degeneracy and centrality measures

A graph is said to be k-degenerate if and only if every subgraph has a vertex of degree at most k. The degeneracy of a graph is defined as the smallest value of k for which it is k-degenerate. As such, it is a measure of sparsity. In order to enrich this notion, let us define the notion of k-core for an undirected graph. A k-core of an undirected graph is defined as the maximal subgraph \(C_k \subset G\) such that each node has degree at least k in this subgraph:

$$\begin{aligned} C_k = \{ i \in V | \,\,\, | C_k \cap \mathcal {N}_i | \ge k\} \end{aligned}$$
(3)

If \(C_k\) is not empty, then G has degeneracy at least k since there exists a subgraph that contains nodes of degree k. Therefore, the degeneracy of G is the largest k for which \(C_k\) is not empty. The core number of a vertex is the maximum k for which he belongs to \(C_k\). The notion of k-core extends to directed graphs, with in and out cores (i.e., \(\mathcal {N}_i\) becomes \(\mathcal {N}^{-}_i\) and \(\mathcal {N}^{+}_i\) respectively). There exist linear-time algorithms to compute k-cores (Batagelj and Zaversnik 2003), and memory efficient methods (Cheng et al. 2011). As presented in Giatsidis et al. (2013), it is possible to extend the definition for directed graphs, considering 2-dimensions version (\(\mathcal {D}\)-cores), but they are not considered in this work. In the scope of this study, the computation of k-cores will yield dense communities of entities having important intra-ownership when k is close to the degeneracy of the graph.

Rooted influence graph

For each entity, we define a subgraph, the rooted influence graph (RIG), similarly to the rooted citation graph (RCG) in collaboration analysis (Giatsidis et al. 2019). The RIG of a vertex i is the subgraph of G induced by the set of vertices that contain i and all the vertices which can be reached by a directed path. That is, \(j \in\) RIG if and only if there is a directed path from vertex i to vertex j. The resulting directed acyclic graph (DAG) contains all the entities that are directly or implicitly influenced by the entity i.

Based on this definition, it is natural to consider the following quantities in the RIG to measure influence of an entity:

  • (a) out-degree

  • (b) average degree of the nodes in the RIG

  • (c) core influence (i.e core number of the considered entity of the undirected RIG)

In Sect. 3, we use the core influence measure since we are interested in the influence of communities rather than individuals. Indeed, coreness in the RIG of an entity is a measurement of how dense is its neighborhood.

Aggregation by attributes

The ownership graph also contain entity attributes (location (country or region) and a description of the activity (sector), cf Sect. 3.1 for a precise description). Here, we present a method to analyze the graph of entities by attributes. Let \(\mathcal {A}\) be the set of values for a given attribute of the entities. For \((a,b) \in \mathcal {A}^{2}\), let \(G_a\) (resp. \(G_b\)) be the set of entities having attribute a (resp. b). We define a new graph \(G_{\mathcal {A}} = (\mathcal {A}, E_\mathcal {A})\) between attribute values in the following way:

$$\begin{aligned} w_{ab} = \frac{1}{|G_b|}\sum _{j\in G_b} \sum _{i \in G_a \cap \mathcal {N}^{-}(j)} w_{ij} \end{aligned}$$
(4)

In other words, \(G_\mathcal {A}\) provides a kind of “meta-graph” based on the pairwise relationship between the values of \(\mathcal {A}\). For example, a graph where \(\mathcal {A} = \{a,b\}\) defines two countries, will be a graph of two nodes, inheriting the connectivity in the form of an aggregation \(w_{ab}\) on up to two directed edges \(E_\mathcal {A}\). This definition of Eq. (4) insures the following mathematical conveniences, as also in Eq. (1):

\(\forall (a,b) \in \mathcal {A}^{2}\)

$$\begin{aligned} 0&\le w_{ab} \end{aligned}$$
(5)
$$\begin{aligned} w_{ab}&\le \frac{1}{|G_b|}\sum _{j\in G_b} \sum _{i \in G_a \cap \mathcal {N}^{-}(j)} w_{ij} \nonumber \\ w_{ab}&\le 1 \end{aligned}$$
(6)

and

$$\begin{aligned} \sum _{a \in \mathcal {A}}w_{ab}&= \frac{1}{|G_b|}\sum _{j\in G_b} \sum _{a \in \mathcal {A}} \sum _{i \in G_a \cap \mathcal {N}^{-}(j)} w_{ij} \nonumber \\&= \frac{1}{|G_b|}\sum _{j\in G_b} \sum _{i \in \mathcal {N}^{-}(j)} w_{ij} \nonumber \\&\le \frac{1}{|G_b|}\sum _{j\in G_b} 1 \nonumber \\ \sum _{a \in \mathcal {A}}w_{ab}&\le 1 \end{aligned}$$
(7)

Moreover, the choice of Eq. (4) is a good candidate for the influence of a attribute a over an attribute b (e.g influence of Germany over France), since it corresponds to a quantity representing the percentage or total capital owned by a over b. Therefore, The meta-graph \(G_\mathcal {A}\) allows us to analyze interactions between attributes (i.e countries or sectors). However the limitation of this analysis lies in the assumption that entities have the same capital: we will come back to this limitation in Sect. 3.

Diffusion models and influence maximization

Influence maximization (IM) in a network is the problem of maximizing influence with regards to seed nodes using a diffusion model. It has been extensively studied recently due to its potential commercial value. An example of application of influence maximization is viral marketing (Domingos and Richardson 2001), where an organization wants to spread the adoption of a product from selected adopters. Influence maximization is also the corner stone in other important applications such as network monitoring, rumor control, and social recommendation. The first question is how to model the information diffusion process in a network, which affects the influence spread given a seedset.

In the previous section, we have defined the nature of the complex network we consider. In the following, the object of the spreading in this network is considered of factor of economic ownership which has different implications on economic and social influence. There exist several models for diffusion (Li et al. 2018). The first category of these models are called progressive. Activated nodes cannot be deactivated in later steps. Most IM algorithms consider the progressive models, and we will limit the discussion our study to those. In the next two paragraphs, we present two standard progressive models.

incoming neighbors are active. Each vertex v is associated a threshold \(\tau _v\), and gets activated if and only if the sum of incoming neighbors (incoming weights if the graph is weighted) is higher than \(\tau _v\).

The Independent Cascade (IC) model Independent Cascade (IC) is a classic and well-studied diffusion model (Goldenberg et al. 2001). A vertex v is activated by each of its incoming neighbors independently by introducing an influence probability to each edge. Given a seed set S at time step 0, the IC diffusion unfolds in discrete steps. Each active user u in step t will activate each of its outgoing neighbor v that is inactive in step \(t+1\) with probability \(p_{uv}\). The activation process can be considered as flipping a coin with head probability \(p_{uv}\): if the result is head, then v is activated; otherwise, v stays inactive. u has one chance to activate its outgoing neighbors. Then, u stays active and stops the activation. The diffusion instance terminates when no more nodes can be activated. The influence maximization (IM) problem is theoretically complex in general. More precisely, it has been proven that there is no polynomial algorithm solving IM unless \(P=NP\) under most of the diffusion models (Li et al. 2018). This theoretical result implies that it is very challenging to retrieve a solution close to the optimal seed set, and to scale to large graphs simultaneously. A naive algorithm exploring all the possible values for the seed sets of size k in a graph with n vertices yields a \(\left( {\begin{array}{c}n\\ k\end{array}}\right)\) time-complexity (supposing the diffusion is in O(1)).

Influence function The definition of an influence function depends on the fundamental assumption that the diffusion model D terminates given any set of nodes S as input. In these conditions, \(\sigma _{D,G}\) represents the function counting the number of vertices affected by the spread at the end of the process using the graph G. It operates on the set of subsets of V having k vertices. Given a diffusion model D, a graph G and a seed set size k, the IM problem corresponds to the following optimization problem:

$$\begin{aligned} S_{D,G}^{*} = {{\,\mathrm{arg\,max}\,}}_{|S| = k}{\sigma _{D,G}(S)} \end{aligned}$$
(8)

DG are usually are implicit, so that \(\sigma _{D,G}\) usually becomes \(\sigma\). There exists a literature dedicated to IM (for a survey, see for instance Li et al. (2018)). This work is not intended as a survey on the topic, so we report some results and methods that we considered in order to measure entity influence in the capitalistic graph.

Greedy framework The greedy algorithm consists in adding iteratively nodes to S (starting from \(\emptyset\)), selecting a vertex u if u provides the maximum marginal gain to the influence function \(\sigma _D\). The algorithm terminates when \(S=k\). The theoretical guarantee of the greedy framework depends on the properties of \(\sigma\) (so implicitly on the diffusion model D and G). In the diffusion model we are considering (LT and IC), \(\sigma _{D}\) is a non-negative monotone and sub-modular function (for a proof, cf. for instance Mossel and Roch (2010)), but this property does not depend on the graph G considered. In these conditions, it is possible to estimate the approximation ratio of the greedy framework. If \(S^{1}\) is the set of vertices returned by Algorithm 1, then Nemhauser et al. (1978):

$$\begin{aligned} \sigma _{D,G}(S^{1}) \ge (1-(1-\frac{1}{k})^k)\sigma _{D,G}(S^*) \end{aligned}$$
(9)
figure a

Sketch-based methods There is a plethora of methods based on Monte-Carlo simulations in order to obtain approximate solutions of the IM problem. Sketch-based approaches are a range of methods that try to conceal theoretical efficiency (not only practical efficiency) and preserving a constant approximation ratio. The main disadvantage of these methods is that they do not generalize to all diffusion models. However, they are compatible with LT and IC (using a generalization through the triggering model Kempe et al. (2003)). Specifically, Borgs et al. (2014) avoid the limitation of Greedy and propose a drastically different method for influence maximization under the IC model, called Reverse Influence Sampling (RIS). We first introduce two concepts:

Reverse reachable (RR) set Let v be a vertex in G, and g be a sub-graph of G obtained by removing each edge e in G with \(1-w_e\) probability (\(w_e \in [0,1]\)). The reverse reachable (RR) set for v in g is the set of nodes in g that can reach v. (That is, for each node u in the RR set, there is a directed path from u to v in g).

Random (RR) set Let G be the distribution of g induced by the randomness in edge removals from G. A random RR set is an RR set generated on an instance of g randomly sampled from G, for a node selected uniformly at random from g.

Sketch-based methods consider \(\theta\) sketches of the graph: \(\{G_1, \ldots , G_\theta \}\), and \(\sigma _{D,G}(S)\) is computed as the average number of users reached by S on these sketches: \(\hat{\sigma }_{D,G}(S) = \frac{1}{\theta } \sum _{i=1}^{\theta }\sigma _{D,G_i}(S)\). Borgs et al. (2014) showed that it is not necessary to estimate the influence using sketches on the entire graph.

figure b

The average infected seed set over the sketches using LT and IC can be proven to converge in probability to the infected set of the all graph (Kempe et al. 2003). Moreover, the quality of the solutions returned by some of the sketch-based methods (e.g IMM Tang et al. (2015)) have been proven to hold an approximation ratio of \((1-\frac{1}{e}-\epsilon )\) with high probability (Tang et al. 2014).

Results

Description and preprocessing

Orbis is a database composed of about one hundred million entities, developed by a specialized group based in Bureau Van Dijk’s Brussels office, aggregating several sources of data. A detailed description is given in Dijk (2018). The location of an entity is usually a country (or else a region such as Hawaï (U.S), La Réunion (Fr.)) that defines where the legal entity has its headquarters. Therefore, a subsidiary and a parent company can have different locations. See Beddi and Mayrhofer (2010) for in-depth insights into the role of location in headquarters-subsidiaries relationships. The sector of a legal entity is a description of its activity. The database had some inconsistencies and missing values. In some cases entities (and edges) appear multiple times. We merged identical edges into a single edge by taking the maximum weight. About \(30\%\) of weights are missing. According to the documentation of Orbis, these links are indirect, i.e they represent ownership through other entities and, since we wish to measure direct ownership, we simply removed these edges for our analysis.

Degree distribution and connected components

We used the igraph software implementation (Csardi and Nepusz 2006) to manage the graph structure emanating from this database. After pre-processing, the graph of non isolated entities (i.e. entities with at least one edge) is composed of 39, 398, 321 nodes and 80, 874, 728 edges. Following our discussion in the first section, we consider the subgraph of juridical entities (organizations) for our entity influence analysis. The subgraph of non-isolated organizations is composed of 6, 516, 332 nodes and 6, 670, 813 edges, with 1, 429, 853 components; additional numbers are in Table 1. An important metric in graph mining is the density, representing the average connectivity (i.e. abundance of edges) of the graph. The density of a directed graph is a real number in [0, 1], maximized for cliques and minimized for a graph of isolated nodes. For the graph at hand the density is:

$$D = \frac{|E|}{|V|(|V|-1)} = 1.571e-07$$

which means that the graph is very sparse. The degree distribution is in Fig. 1. We can see that the degrees vary importantly in the graph. More specifically there are few entities with up to more that a million capitalistic relations whereas the vast majority have less that one hundred ones.

Table 1 Capitalistic ownership graph instance (Orbis). (1) All edges, (2) Edges with unknown weights removed

The distributions of entities by country is displayed in Table 2. We see that the distribution is not uniform and is not correlated with the sizes of each country. There are relatively few entities in the United States (US) compared to some European or South American countries. This is because the data source has little knowledge of US entities. One possibility for refining the study in relation to this country would be to supplement Orbis with data from other sources (outside the scope of this work).

The subgraph of organizations contains a very large component (i.e. a set of entities where each pair of nodes is connected via a path) of 1,442,704 nodes and 1,816,874 edges. The other components are very small, smaller than one thousand nodes. The component-size distribution is in Fig. 2, which shows that there are a lot of small components (some hundred of nodes), and lots of isolated nodes. Two drawings of smaller components (less than a thousand nodes), are depicted in Fig. 3. We do not include a fine-grained classification of the structure of these small components. However, in the next subsections we provide the analytics on the largest component of juridical entities of 1, 442, 704 nodes following Sect. 2.

Fig. 1
figure 1

Degree distribution of organizations (Orbis)

Fig. 2
figure 2

x-axis: Size of components. y-axis: Number of components. The figure represents the distribution of the sizes of connected components of juridical organizations (companies). It contains one large component (\(\sim 2\) millions nodes), and thousands of small components (between ten and a thousand entities)

Table 2 Top 20 most frequent countries in total

Degeneracy and centrality measures

In Sect. 2.5, we define the notion of k-degeneracy and k-cores, named \(C_k\). The maximum k-core (defined as the k-core such that \(C_k\) is not empty and k is maximal) of the graph allows to find an approximation of the densest part of the graph in linear time (Batagelj and Zaversnik 2003). On the one hand, in the original graph of entities restricted to organizations, the large component of 1,442,704 million nodes has a degeneracy value equal to 18 composed of 45 entities. \(C_{18}\) is a very dense community of entities, where each of them is being owned (resp. owns) the capital of (resp. by) at least 18 other entities in total. Recall that the other components are much smaller (i.e few hundreds of entities) and sparse, so that degeneracy in theses components is not very informative for data mining.

On the other hand, the distribution of countries and sectors in \(\{C_{k} \; |\; k \ge 9\}\) (992 entities) are displayed in Tables 3 and 4. These tables show that countries and sectors are not uniform but shared between a limited amount of countries (Singapore, Australia, Germany, Ukraine, and France), and sectors (Activities of holding companies, Buying and selling of own real estate, Other monetary intermediation, Rental and operating of own or leased real estate, Activities of head offices). These countries and sectors have a more intense interaction for capitalistic property.

Fig. 3
figure 3

2 examples of small components

Table 3 20 most frequent countries in the 10 top-k cores (992 entities)
Table 4 Most frequent sectors in top k-cores (992 entities)

Aggregation by attribute

In order to provide an example of location analysis of such network, we consider the case of France. We computed the aggregated sites whose capital is most possessed by French entities (top 20 countries ranked by edge weight). Conversely, we computed the top 20 aggregate locations that own capital of French entities. We reported results in Fig. 4. The entities most held by the French entities are those located in their former colonies (ie Algeria, Togo, Congo, Côte d’Ivoire, Benin, Chad, Gabon, Senegal, Cameroon, Comoros, Madagascar, ...). Conversely, the French entities are mostly owned by those located in economically strong countries, such as Germany, China, the United States, Japan, the United Kingdom, the United Arab Emirates, as well as countries close to France geographically or sharing close historical relations (Belgium, Italy, Morocco, ...). We also reported top-5 neighbors sorted by weight in descending order (Fig. 4d, c). Top in weights are much lower that the top out weights, this means that French entities have a tendency to own capital of entities in other countries rather to have capital owned at the international level.

Figure 5 depicts the densest k-core (\(k=44\)) on the meta-graph of countries, and represents the most connected subgraph of countries of capitalistic ownership. We observe an important proportion of countries in Europe (France, Germany, United Kingdom, Italy, Spain, Portugal, ...). Other economically strong countries such as United States, China, United Arab Emirates and Republic of Korea are also represented. These two categories have a dense collaboration in terms of capital ownership. Two important aspects about Fig. 5:

  • Some “tax havens” are present (Cayman Islands, Malta, Cyprus, ...) despite their relative low number of entities.

  • The United States of America are in the densest k-core of the meta-graph of countries, even though under-represented in the dataset (less than 850000 of entities, cf. Table 3).

Figure 6 depicts the densest k-core (\(k=218\)) on the meta-graph of sectors (limited to 15 sectors with highest degree for visibility). It reveals a dense collaboration between several economic sectors (Financial activities, real estate, engineering and technical consultancy, ...).

Fig. 4
figure 4

France as source and target of capital ownership

Fig. 5
figure 5

Densest k-core on the meta-graph of countries (\(k=44\))

Fig. 6
figure 6

Densest k-core on the meta-graph of sectors (\(k=218\), limited to 15 highest degree sectors for visibility)

Influence analysis

Recall that, in the context of this paper, influence is the capacity to have an effect on the development or decisions of an entity. In this subsection we present two different types of methods to measure it. Then, we compare the results obtained using the same diffusion model. We used two coreness-based methods to measure influence indirectly. First, sorting nodes based on their coreness number in the graph. Second, following Sect. 2.6, we compute the coreness of each node in its \(\mathsf{RIG}\), and sort them by decreasing value. Then, we keep the top-k nodes and consider the distribution of attributes within these. The results with \(k=10000\) are in Table 5 (left side) for location analysis and in Table 6 for sector analysis.

The problem and method used are described in Sect. 2.8. We use the influence maximization (IM) paradigm in order to measure entity influence in the capitalistic graph. Here, we ran the simulations using influence maximization with martingales, with \(\epsilon =0.1\) and a seed set of 10,000 nodes. The output is a seed set of 10,000 nodes which is an approximation of the IM solution whose quality will be discussed in Sect. 4. We display the distributions within this seed set in the right part of Table 5 for location analysis, and Table 7 for sector analysis.

Table 5 Top 20 influential countries using coreness on the RIG (left) and influence maximization on the IM on the total network of entities (right)
Table 6 Top 20 influential sectors using coreness using RIG
Table 7 Top 20 influential sectors using IMM
Table 8 Comparison with the IC model for several seed set sizes (50 simulations)

In order to estimate the quality of their respective solutions, we compared the number of infected nodes using our diffusion model (independant cascades), with entities obtained in the top k-cores of the graph. We define the ratio \(\tau\) as the ratio of the seed set size and the number of vertices of the graph. For practical software reasons (NDlib library Rossetti et al. 2018), we considered larger seed sets (\(\tau \in \{5\%, 10\%, 15\%, 20\%\}\) of the considered graph where \(\tau = 5\%\) corresponds to 75,136 nodes). The results are shown in Table 8.

Discussion and conclusion

Our analysis based on centrality measures and influence maximization is consistent with economic and historical facts such as the existence of tax heavens, and private capital ownership over its former colonies (we specifically studied the case of France). This therefore allowed us to reveal important locations and sectors communities in the economy, and potentially target most influential entities. Our method allows us to validate these intuitions and to provide indicators of the influence of countries via private capitalistic possession.

Initial seeds using k-cores and yield a better influence score for several seed set sizes ranging from 5 to 20% using the independent cascade diffusion model. Results suggest that several means should be considered (depending on the size of the seed set) to measure influence in large networks with similar degree distribution. Our main result via k-core analysis on the capital–ownership data of organizations reveals a central component. And within, our analysis leveraging the different approaches listed in the previous section helped define communities of countries or sectors that are influencing the economy of capital ownership.

As with any study of this kind, there are possible limitations. For example, we may highlight the limitation of the approximation of different sized entities as having the same capital, since we aggregated the weights without taking in account the capital value. That is to say, the size of capital of entities is not explicitly modeled (entities are assumed to have the same capital). However, although a direct deterministic link cannot be established, there are strong links (Brailsford et al. 2002), in particular a positive relationship between ownership and capital structure. Therefore from a probabilistic point of view, one can argue that our analysis is based on structure with the volume of capital ‘marginalized out’. And this appears to be the case in so far as that our results and conclusions corroborate those of the literature. This, of course, is not to say that a follow up analysis (including capital holdings directly) would yield a finer-grained analysis.

The influence measurement following methods of RIG and IMM (Sects. 2.6 and 2.8, respectively) revealed significant difference in the distributions of attributes. For all seed set sizes, we have obtained better results using RIG coreness than with IMM. Influence based on coreness on the initial graph yields better solutions for \(\tau \in \{5\%\), \(10\%\), \(15\%\}\) seed set sizes, and IMM performs a bit better than top-k coreness for \(\tau =20\%\). This is an interesting discovery. In spite of good theoretical guarantees, the IMM algorithm of influence maximization suffers on some graph instances and seed set sizes found in practice (such as this one). This is also in accordance with experiments of Giatsidis et al. (2019) suggesting that coreness can yield excellent influence measurement in our context of interest of this work. We thereby suggest that for large and sparse networks such as the one we considered, coreness-based influence measurement should be preferred.

Our results are inline with the literature. For example, Garcia-Bernardo et al. (2017) also use a data-driven approach to identify sinks, which included the Netherlands, the United Kingdom, Ireland, Singapore, and Switzerland—all identified in the top-44 core countries in our analysis, and nearly all (except Switzerland) in the top-20 core. In this context, sinks attract and retain foreign capital. An example was given in Garcia-Bernardo et al. (2017) where in three years (2007–2009) Google moved for tax reasons most of its profits generated outside the United States (US$12.5 billion) through corporate entities in the Netherlands. On the other hand, other countries are attractive as intermediate destinations of international investments and transfers. For example our analysis uncovered the Cayman Islands. This is covered in discussion it the literature and, e.g., Roberts (1995); Fichtner (2016), specifically highlight and cover this case. Authors in Vitali et al. (2011) found that transnational corporations form central structure where a large portion of financial control flows to a small tightly-knit core of financial institutions; this also corroborates our results. The centralization we see has been explained as a partial results of tightening monetary policy (Brancaccio et al. 2019). It appears—and this is supported by, e.g., Vitali et al. (2011) who refer to an economic ‘super-entity’, that not only an effect of policy but which could have major future policy implications. Therefore it is important that ongoing research confirms these structures.

It may appear as surprising that U.S. entities did not have a relatively more important role, though we emphasise that they simply do not appear in this data. Although we have limited information on the U.S., other research, e.g., Heemskerk and Takes (2016)—who used community detection through modularity maximisation—discusses strong transatlantic connections between Europe and North America as part of a dense network of shared directors and control, as opposed to a relatively more separate Asian cluster still relatively subservent to the US-Europe center. In other words, traditional core countries still form dominant function. On a more detailed level, Glattfelder and Battiston (2009) show that although Anglo-Saxon countries have highly concentrated control in the hands of a few shareholders, whereas in European countries this concentration is relatively less significant. When we looked at France in particular, we see diversity among neighboring countries with outward flow, and inward flow (i.e., greater control) over former colonies. This provides further insight (and validation) of the functionality of Orbis instance Dijk (2018).

We notice that IMM improves over RIG. The influence measurement following methods of RIG and IMM (Sects. 2.6 and 2.8 respectively) releaveled significant difference in the distributions of attributes. For all seed set sizes, we have obtained better results using RIG coreness than with IMM. Influence based on coreness on the initial graph yields better solutions for \(\tau \in \{5\%, 10\%, 15\%\}\) seed set sizes, and IMM performs a bit better than top-k coreness for \(\tau =20\%\). This is an interesting discovery. In spite of good theoretical guarantees, the IMM algorithm of influence maximization suffers on some graph instances and seed set sizes found in practice (such as this one). This is also in accordance with experiments of Giatsidis et al. (2019) suggesting that coreness can yield excellent influence measurement in our context of interest of this work. We suggest that for large and relatively sparse network, influence measurement with coreness should be preferred.

We outlined results that should be treated with caution, due to the sparsity of data, particularly respective of some countries, for example—in this dataset—the United States. Nevertheless, the overall results provide important insight and information on the influence of the entities and are for the most part consistent with the current global economic situation. Of course, further enrichment of the capital ownership dataset with additional entities and capital information will likely provide an even more precise quantitative measure of influence. We could also consider studying the evolution of this graph over time (monthly or yearly), but this is out of scope of the current study and we leave that line of research for future work.