Introduction

In the “Big Science” era, research collaboration plays an important role in knowledge creation, iteration, and dissemination (Gazni et al., 2012; Katz & Martin, 1997). Thanks to the increasing scale and complexity of scientific projects, we have witnessed a rapid growth in the frequency and influence of collaborative research (Chen et al., 2019; Wang et al., 2013). This trend could be evidenced by the increase in research collaborations at a distance (Bercovitz & Feldman, 2011; Hoekman et al., 2010).

Research collaborations involve not only explicit but also tacit knowledge, which cannot be shared viacommunication technologies (Holste & Fields, 2010). The coordination activities in collaboration projects also require frequent mobility of researchers, such as sharing lab facilities and exchanging data/samples (Hoekman et al., 2010). Geographical distance remains significant in research collaboration, especially those larger-scaled projects, which often lead to more impactful results (Larivière et al., 2015). It is essential to gain an in-depth understanding of the trend towards long-distance and even international collaborations for the following reasons. First, it enhances the quality of research by combining expertise and resources and sharing the cost (Hoekman et al., 2010). Compared with single-authored publications, collaborative work often leads to more novel and higher internal quality control (Fernandez et al., 2016). Collaborative work does go with increasing self-citation, but the increase is much weaker than that of foreign citations (Glänzel & Thijs, 2004). Furthermore, these benefits increase as the distances between collaborators increase, as potential collaboration partners are more likely to be found within a larger radius (Hoekman et al., 2010). Indeed, international collaborative research, particularly those with elite partners (Tang, 2013) from high-performing countries (Glänzel & Schubert, 2001), tends to be of higher quality and more significant impact than local collaborators (Guan et al., 2016; Jiang et al., 2018). Second, significant public policies and expenditures have been established and allocated to facilitate long-distance research collaborations (Hoekman et al., 2010). For example, China’s 12th Five Year (2011–2015) Plan for Science and Technology Development stated that China would actively take part in international science and technology organizations and international research collaborations (Wang & Wang, 2017). Similarly, the European government attempts to construct a European Research Area by coordinating regional, national, and EU research activities (Hoekman et al., 2010) to improve European states’ internal consistency and to break down barriers to innovation. Third, many countries and regions are making great efforts to attract overseas talents, which further facilitate international collaborations (Baruffaldi & Landoni, 2012). For example, the 1000-talent program of China has attracted more than 7000 full-time and part-time senior scholars by 2018. Many are the leader in the field, and 22.5% are non-Chinese researchers (Jia, 2018). Within the first five years of the 1000-talent program, China has attracted more full professors than the past 30 years combined.Footnote 1 These overseas talents bridged the gap between the research communities in two countries and significantly improved both the intensity and quality of international collaborations.

Existing literature on spatial research collaboration has two main limitations. First, most studies assume that the research collaborations are homogeneous while ignoring the existence of leadership. It is sensible that the first and corresponding author(s) play a leading role in the collaboration. Thus their relationship with other authors should be more potent than those among others (Wang et al., 2014). Several studies examine the research leadership (González-Alcaide et al., 2017; Wang et al., 2014). However, they either do not consider the social network structure of the co-authorship network (Chaocheng et al., 2019) or ignores the intensity of collaborations while measuring the leadership (Zhou et al., 2018). Second, a collaboration relationship is assumed to be evenly distributed to authors of the same article without considering spatial features between the institutions. Existing algorithms for identifying key research entities in collaborations mainly focus on the topological features of the entities in the collaboration network (Zhou et al., 2018), without incorporating the spatial features into the network. Collaboration at a longer distance has a higher impact than that at a short distance (Hoekman et al., 2010). Identifying important research entities in research collaboration from a spatial perspective becomes necessary and essential to knowledge creation and dissemination (Forman & van Zeebroeck, 2019; Wu, 2013).

To fill the above research gaps, we extend the literature as follows. First, we model the collaboration relationships as a directed network at the institution level, where the direction of an edge indicates the leadership flow between the two institutions. Second, we incorporate the spatial features between institutions as a weight on edge in the collaboration network. Third, based on the constructed geographically weighted collaboration network, we propose a novel metric, namely the spatial research leadership rank (SpatialLeaderRank), to identify the leading institutions while considering both the collaboration intensity and the spatial features. More specifically, an institution is considered with higher leadership status according to the following three criteria: (a) the institution frequently plays the corresponding rule in papers with other institutions; (b) the institution frequently plays the corresponding rule in longer distance and even cross-linguistic-border collaborations; (c) the participating institutions led by the institution have high leadership status themselves. We exemplify and validate the proposed SpatialLeaderRank metric using the journal publications in the pharmaceutical sciences, a field that has witnessed a dramatic increase in collaborations between multinational scientists in both academia and the biotechnology sector because of the need for very diverse expertise from various disciplines and the rise of R&D outsourcing (Herrling, 1998; McKelvey et al., 2003; Plotnikova & Rake, 2014). We also examine the SpatialLeaderRank in Information Science and Library Science as the robustness check.

The remainder of this paper is organized as follows. Firstly, we review related literature. Secondly, we describe the data and methodologies. Thirdly, we present the results. Fourthly, we conclude the paper. Finally, we discuss the limitations and future directions.

Related literature

Spatial research collaboration

The spatial co-presence remains significant in carrying out the research collaboration tasks, though the hindering effect of spatial distance to research collaboration has declined due to the advancement in communication technologies (Hoekman et al., 2010). However, research collaboration involves not only explicit but also tacit knowledge. Explicit knowledge can be easily articulated or exchanged with communication technologies. Tacit knowledge, such as the abilities, developed skills, experience, and undocumented processes, is embedded in one’s experience and values (Holste & Fields, 2010). In research collaboration, tacit knowledge can be transferred between researchers by personal contact. However, it cannot be passed via communication techniques in formulas, diagrams, verbal descriptions, or instructions (Collins, 2001). Face-to-face interaction enables researchers to conduct intense and complex forms of communication, which greatly facilitate the transfer of tacit knowledge (Collins, 2001). What’s more, the coordination activities in large-scale collaboration projects involve frequent mobility of researchers, such as sharing lab facilities, meetings, seminars. This imposes frequent and considerable travel time and economic costs. As the scale, complexity, and period of collaboration projects increase, the tacit knowledge sharing tends to be more essential, and the coordination activities tend to be more costly. The ongoing COVID-19 pandemic poses new challenges to research collaborations (Callaway, 2020), particularly in the pharmaceutical field (Yu et al., 2020), due to the travel bans, quarantine and disruption of the supply chain system. Therefore, spatial distance remains significant in research collaboration, especially those with large scale, high complexity, and long period.

The existing research on spatial research collaboration mainly focuses on the spatial pattern of research collaboration and the role of geographical proximity on research collaboration. The spatial pattern of research collaboration has been systematically studied at multiple levels. A recent study illustrated the spatial patterns of international knowledge flows and evaluated the effect of various factors, including the geographical factor (Gui et al., 2018). The establishment of research alliances with more developed countries constituted a critical mechanism, which integrated developing countries into the global research community (González-Alcaide et al., 2017). Researchers in small European states were found to be less homogenously collaborating with both domestic and foreign partners (Ukrainski et al., 2014). Research activities in Brazil were found to be spatially heterogeneous. A geographical decentralization process of scientific research activities across the country was needed to stimulate the development of those privileged areas (Sidone et al., 2017). The role of geographical proximity has been explored in various fields. Plotnikova and Rake (2014) examined the country-level determinants in pharmaceutical research and found that the geographical distance was negatively associated with the research collaboration. In humanities, arts, and social sciences, the geographical distance was found to be critical to the collaborative activities (Luo et al., 2018). Similar conclusions could be drawn from the field of ecology (Parreira et al., 2017) and immunology (Lander, 2015).

There are three main types of drivers of long-distance and cross-border collaboration (Jeong et al., 2014): input factor, motivation, and informal communication. A typical driver of input factor is academic excellence. Academic excellence may not only lead to superior synergy through cross-border collaboration but also lower the cost of searching cross-border partners due to the high reputation of researchers with academic excellence (Jones et al., 2008). As for motivation, researchers at low academic levels tend to undertake long-distance and cross-border collaboration for higher academic productivity and quality (Laband & Tollison, 2000). As for informal communication, it’s widely acknowledged that most research collaborations begin informally and are triggered by informal communication. Particularly, informal face-to-face communication plays an important role in the formulation of long-distance and cross-border collaborations (Jeong et al., 2011). Under some conditions, the geographical distance may not as significant as usual to research activities. For instance, once social relationships are established, patent inventors can remain socially close in citing patents even though they are geographically separated (Agrawal et al., 2006). Collaboration with local elite partners may lead to higher research quality than long-distance or international collaboration (Tang, 2013). However, most studies show that research quality is positively correlated with international collaboration. Table 1 summarizes the relevant literature reporting the correlation between collaboration at a distance and research performance. Research collaboration at a distance has become increasingly prevalent (Bercovitz & Feldman, 2011; Hoekman et al., 2010). Given the abundant evidence that the spatial features influence the research performance, existing research on collaboration networks did not account for the spatial patterns and mainly assumed that the edges are homogeneous in the collaboration network.

Table 1 Selected empirical studies about collaboration at a distance and research performance

Research leadership

Research leadership is associated with the capability to integrate the research resources, such as knowledge, expertise, and equipment, to launch, advance, and sustain research projects (Wagner et al., 2001), and is associated with high production as well as quality (Klavans & Boyack, 2010). Research leadership has been a topic of perennial interest in clinical research (Campbell et al., 2001), entrepreneurial organizing (Hansson & Mønsted, 2008), social network analysis (Babbar et al., 2019), etc. Research leadership can be classified into three sub-classes (Klavans & Boyack, 2010): publication leadership, reference leadership (Calero-Medina et al., 2008), and thought leadership (Klavans & Boyack, 2008). From the bibliometrics perspective, a typical way to measure research leadership is to examine the authorship position (Chinchilla-Rodríguez et al., 2019; Quan et al., 2019). The authorship positions in the byline reflect each author’s relative contribution (Liu & Fang, 2014). Despite the disciplinary bias, it is commonly acknowledged that the first author and the corresponding author often lead the research collaboration and make a major contribution (Wang & Wang, 2017). In the biomedical field, the first author is often an early-career researcher assigned to carry out the research and write the research paper. Simultaneously, other non-leading co-authors act in more specialized roles (Sekara et al., 2018). To limit irresponsible authorship listing, the International Committee of Medical Journal Editors (2018) establishes a series of authorship criteria in medical journals. Notably, it elaborates on the corresponding author’s role that the corresponding author is responsible for communication with the journal during the submission, peer review, and publication process. The corresponding author should be available during the whole process to timely respond to any editorial queries. Therefore, corresponding authors should be responsible for both scientific and non-scientific contributions. Moreover, the corresponding author’s responsibility is more prominent with the increase of collaboration scales, the growing complexity and depth of the research (Hemlin et al., 2013).

Recently, the term “research guarantee” has been widely adopted as an indicator of research leadership (Moya-Anegón et al., 2013). The concept of “research guarantor” was initially introduced by Rennie et al. (1997). Literature does not claim that an individual corresponding author is the research guarantor, but rather the research group/institution to which the corresponding author belongs is the research guarantor. In collaboration, the research guarantor contributes substantially and makes sure the integration of the entire project. Thus, the research guarantor has a special role in collaboration, including negotiating the authorship within the team, obtaining the corresponding authorship, and receiving more credit (Moya-Anegón et al., 2013; Sánchez-Jiménez et al., 2017). Building on the above studies, since we are studying research leadership at the institution level, we employ the corresponding author’s institution as an indicator of research leadership.

PageRank and LeaderRank

PageRank was initially proposed by Google to rank the importance of webpages (Brin & Page, 1998). There is a boom in its variation and application in a broad set of fields in the following decades. PageRank has also been widely applied to analyzing the research collaboration network. Liu et al. (2005) transformed each undirected edge in the research collaboration network into a set of bi-directional, symmetrical edges and defined modification of PageRank, namely the AuthorRank. Fiala et al. (2008) modified the PageRank by incorporating both citation and co-authorship graph property.

Successful as it is, PageRank has several drawbacks (Gleich, 2015). The stability of ranking and the robustness to noise and manipulation vary given different parameters (Lu et al., 2016). Moreover, if there are disconnected components in the network, the ranking result is not unique. To this end, Lu and Zhou (2011) proposed the LeaderRank, an adaptive and nonparametric algorithm, by adding a ground node that bi-directionally connected to every other node and then performing random walks. As a result, LeaderRank has a faster convergence rate, higher stability for noisy data, and more robustness to manipulations (Li et al., 2014). LeaderRank is further applied to identify the influential nodes in complex products and systems (Li et al., 2019); in power grids (Zhou et al., 2019); in manufacturing services (Wu et al., 2019). Notably, in the field of “Library & Information Science”, Xu et al. (2020) found that LeaderRank had the best performance in ranking science and technology citation data, compared with other 17 network-based metrics. However, it is noteworthy that all the previous variants of PageRank, including LeaderRank, only consider the topological features of nodes while ignoring other non-topological features, especially the spatial features that are very important to the academic performance and impact.

Data and methodology

Data collection

We leverage the Web of Science Core Citation (WOS) Database to perform a data collection. Specifically, following (Plotnikova & Rake, 2014) we collect publications in categories related to pharmaceutical study during 2010–2018. To check the robustness and have a comprehensive understanding, we also collect the publications in the category of “Information Science & Library Science (ISLS)” and compare the results in these two very different fields. More precisely, the data was retrieved using the search term “WC = A AND PY = B”, where A is the sub-categories in pharmaceutical sciences, and B is 2010–2018. We restrict publications of journal articles in English and exclude other non-journal publications such as meeting abstracts, letters, editorial materials, or reviews. We further restrict the data and sample 323,146 publications in pharmaceutical sciences and 28,158 publications in ISLS with at least two co-institutions. It is noteworthy that in the dataset of WOS, the problem of missing information from the address field remains significant even for recent publications (Liu et al., 2018). This problem may affect the validity of the address-based metric, SpatialLeaderRank. Following (Liu et al., 2018), we collect the publications with missing address information using the search term (Query #1) NOT (Query #2), where (Query #1) is the same as the former search term, “WC = A AND PY = B”, and (Query #2) is AD = (A* OR B* OR C* OR D* OR E* OR F* OR G* OR H* OR I* OR J* OR K* OR L* OR M* OR N* OR O* OR P* OR Q* OR R* OR S* OR T* OR U* OR V* OR W* OR X* OR Y* OR Z* OR 0* OR 1* OR 2* OR 3* OR 4* OR 5* OR 6* OR 7* OR 8* OR 9*). We obtain 6074 and 2613 address-missing publications in English in pharmaceutical sciences and ISLS, respectively. Most of the address-missing publications are either single-authored or without abstract and full text and thus are excluded from the data. We further concentrate on 540 and 77 multi-authored publications with abstracts in pharmaceutical sciences and ISLS, respectively. We manually obtained the full address information of the authors of these papers by searching them on Google. Finally, we extract 2459 institutions in pharmaceutical sciences and 841 institutions in ISLS, which have been the primary affiliation of the corresponding author for at least one paper (with multiple institutions) in each year. To disambiguate institutions’ names, since we are studying at the institution level and WOS provides a complete list of authors and their institutions, we first remove the duplicates, next sort them alphabetically, then manually disambiguate them. We utilize Google Map to obtain the latitude, longitude, and country of each institution. For an institution with multiple addresses, we only adopt the Google Map’s first search result. The linguistic data is from Melitz and Toubal (2014). Note that the Chinese Academy of Sciences (CAS) has over 100 branches distributed nationwide, covering six academic divisions: Earth Sciences, Chemistry, Information Technological Sciences, Life Sciences & Medical Sciences, Mathematics & Physics, and Technological Sciences. In this paper, we only focus on pharmaceutical sciences and ISLS. And the majority of the pharmaceutical (63%) and ISLS (82%) publications are published by the branches in Beijing, where CAS’s headquarter is located. Following Fu et al. (2011) and Li et al. (2009), we pooled the different branches as one heading in Beijing. We do the same for other mega organizations, including Russian Academy of Sciences, French National Centre for Scientific Research (CNRS).

Geographically weighted and directed network

Measurement of research leadership flow

There are mainly three approaches to share the credit of a co-authored paper to each contributor (Gauffriau et al., 2007). From an institution-level perspective, the first approach is all counting approach, which attributes one full credit to each unique collaboration institution. The second approach is fractional counting, where one credit is shared by all institutions. The third approach is straight counting, where only the most prominent institution receives one full credit, while others receive none. Considering we are studying research leadership, we adopt straight counting and attribute one full credit to the leading institution (the corresponding author’s institution). For those papers with multiple corresponding authors and multiple leading institutions, we share the one full credit to the leading institutions in each co-authored paper. In this paper, the notion and measurement of research leadership flow are based on (Chaocheng et al., 2019). Specifically, for a paper \(p{ }\) with leading institution number \({\text{LN}}_{p}\) and total institution number \({\text{TN}}_{p}\), each leading institution receives \(\frac{1}{{{\text{LN}}_{p} }}\) credit. Therefore, the research leadership flow intensity between one leading institution \(a\) and one participating institution b is defined as,

$$\begin{array}{*{20}c} {{\text{RLF}}_{p, ab} = \frac{1}{{{\text{LN}}_{p} }} \times \frac{1}{{{\text{TN}}_{p} }} = \frac{1}{{{\text{LN}}_{p}\, \times \,{\text{TN}}_{p} }},} \\ \end{array}$$
(1)

And the aggregated research leadership intensity between institution \(a\) and \(b\) is defined as,

$$\begin{array}{*{20}c} {{\text{RLF}}_{ab} = \mathop \sum \limits_{p = 1}^{{P_{ab} }} {\text{RLF}}_{p, ab} ,} \\ \end{array}$$
(2)

where \(P_{ab}\) is the number of co-authored papers by \(a\) and \(b\), where \(a\) is the leading institution. And the research leadership mass of \(a\), \({\text{RLM}}_{a}\) (the total research leadership intensity between leading institution \(a\) and all other collaboration institutions) is defined as

$$\begin{array}{*{20}c} {{\text{RLM}}_{a} = \mathop \sum \limits_{b = 1}^{{B_{a} }} {\text{RLF}}_{ab} ,} \\ \end{array}$$
(3)

where \(B_{a}\) is the number of participating institutions that \(a\) has led.

Measurement of spatial score

Boschma (2005) identifies five notions of proximity (geographical, cognitive, institutional, organizational, and social) on research collaboration and interactive learning. Following Boschma (2005), a broad set of literature employs the gravity model and finds that geographical proximity and the other proximities are hinder determinants to the formation of research collaboration (Fernandez et al., 2016; Zhang & Guo, 2017). In the gravity model, the estimation equation is as follows

$$\begin{aligned} I_{ab} & = \beta_{0} + \beta_{1} \lg \,{\text{Pubmass}}_{a} + \beta_{2} \lg \,{\text{Pubmass}}_{b} + \beta_{3} \lg \,{\text{Geographical}}_{ab} \\ & \quad + \,\beta_{4} {\text{Institutional}}_{ab} + \mathop \sum \limits_{k = 5}^{K} \beta_{k} \lg s_{k} + \epsilon_{ab} , \\ \end{aligned}$$
(4)

where \(I_{ab}\) denotes the collaboration intensity, measured by the number of co-publications, \({\text{Pubmass}}_{a}\) and \({\text{Pubmass}}_{b} { }\) denote the number of previous publications of institutions \(a\) and \(b\), respectively. \({\text{Geographical}}_{ab}\) is geographical proximity, which is measured by the spatial distance between two institutions. \({\text{Institutional}}_{ab}\) is the institutional proximity. Institution refers to common habits, cultural norms. Institutional proximity can be measured by a dummy variable, which takes 1 if two institutions are from the same country (Fernandez et al., 2016) or linguistic area (Gui et al., 2018; Zhang & Guo, 2017), and 0 otherwise. \(s_{k}\) denotes other dimensions of proximity, including cognitive, social and economic proximity.

From a geographical point of view, geographical and institutional proximities are spatial features. On the one hand, geographical distance remains significant in research collaboration. On the other hand, linguistic borders render the cross-linguistic-border collaboration less likely due to national funding schemes, institutional framework, norms, or value (Hoekman et al., 2010). Collaborators are expected to use their native languages if a native language is shared among them. Particularly, knowledge sharing and coordinating activities will become easier when collaborators speak the same language (Liang et al., 2006). The cross-linguistic-border nature can capture the institutional proximity more comprehensively. Therefore we adopt cross-linguistic-border nature as the institutional proximity. Long-distance and even cross-linguistic-border collaborations bring unique benefits, such as better access to international data (Jonsen et al., 2013), a higher tendency to stimulate new ideas (Ellis & Zhan, 2011), and increased international visibility and impact (Kwiek, 2015).

Therefore, based on the gravity model, we propose to measure the spatial proximity using the following spatial score, which takes an additive form of both the geographical distance and the cross-linguistic-border nature. For a publication \(i\), in line with the form of the gravity model, the spatial score of leading institution \(a\) and collaborator institution \(b\) pair is defined as:

$$\begin{array}{*{20}c} {{\text{SPS}}_{i,ab} = \lg \left( {{\text{Geographical}}_{ab} } \right) + \lambda \times {\text{Institutional}}_{ab} ,} \\ \end{array}$$
(5)

where \({\text{Geographical}}_{ab}\) denotes the geographical distance between institutions \(a\) and \(b\) and is calculated as the Great-circle distance according to institutions’ latitudes and longitudes (Sinnott, 1984).\({\text{Institutional}}_{ab}\) is a dummy variable indicating whether \(a\) and \(b\) are from different linguistic areas. \(\lambda\) represents the relative importance of the cross-linguistic-border nature and geographical distance of the collaboration. The effect of long-distance and cross-border nature on collaborative work varies in different research fields (Adams et al., 2005). \(\lambda\) is thus a field-specific parameter. The value of \(\lambda\) is obtained by calculating the ratio between the coefficient of \({\text{Institutional}}_{ab}\) and the coefficient of \({\text{Geographical}}_{ab}\), \(\lambda = \left| {\frac{{\beta_{4} }}{{\beta_{3} }}} \right|,\) in the gravity model (Gui et al., 2018; Zhang & Guo, 2017). Following Chaocheng et al. (2019), we perform regression based on the gravity model to obtain the coefficients of different dimensions of proximities. The results showed that, during 2010–2018, \(\lambda\) was 1.29 and 2.86 in pharmaceutical sciences and ISLS, respectively, indicating that the hindering effect of cross-linguistic-border nature in pharmaceutical sciences was lower than that in ISLS. As a field with ever-growing and widely distributed knowledge bases, pharmaceutical sciences have a particularly pronounced cross-border dimension of collaboration (Cantner & Rake, 2014). Regionally clustered pharmaceutical organizations locate R&D facilities to extend their collaboration beyond borders, to connect research partners, and source knowledge on a global scale (Cooke, 2006). The rapid increase of codified knowledge and its dispersion among a variety of research actors induced a pronounced trend of cross-border collaboration. However, for social sciences and humanities, research collaboration relies heavily on language and cultural proximity (Pajić, 2015). And this conclusion could also apply to ISLS (Jokić, 2020), a typical social science field. Indeed, in ISLS, mutual collaboration is intensive between authors with a similar historical and linguistic background (Jokić, 2020). We can also obtain the evolution of \(\lambda\) by running the gravity model for each year. Figure 1 shows the yearly evolution of \(\lambda\) during 2010–2018. The value of \(\lambda\) for pharmaceutical sciences was stable, with a slightly decreasing trend, indicating that the hindering effect of cross-linguistic-border nature has declined slightly over time. This is in line with the important cross-border dimension of collaboration in pharmaceutical sciences. Interestingly, the value of \(\lambda\) for ISLS is relatively high and oscillated periodically. The cross-border collaboration in ISLS relies more heavily on linguistic, cultural, and historical background, compared with that in pharmaceutical sciences.

Fig. 1
figure 1

The values of \(\lambda\) for pharmaceutical sciences and ISLS during 2010–2018

Therefore, the longer the geographical distance is, the smaller the probability of collaboration is, and the higher the spatial score is. Meanwhile, cross-linguistic-border nature positively influences the spatial score. The spatial score of a publication \(p\) is the average spatial score of the leading institution and participating institution pairs, which can be expressed as

$${\text{SPS}}_{p} = \frac{1}{{{\text{LN}}_{p} }} \times \frac{1}{{{\text{TN}}_{p} }} \times \mathop \sum \limits_{a}^{{{\text{LN}}_{p} }} \mathop \sum \limits_{b}^{{{\text{TN}}_{p} }} {\text{SPS}}_{p,ab} ,$$
(6)

where \({\text{TN}}_{p}\) is the number of total institutions and \({\text{LN}}_{p}\) is the number of leading institutions. Similar to Eq. (1), leading institutions share the spatial score of paper \(p\). Each leading institution receives a spatial score \(\frac{{{\text{SPS}}_{p} }}{{{\text{LN}}_{p} }}\). Therefore, the spatial research leadership flow intensity between a leading institution \(a\) and one participating institution \(b\) is as follows,

$${\text{SL}}_{p,ba} = \frac{{{\text{SPS}}_{p} }}{{{\text{LN}}_{p} \times {\text{TN}}_{p} }} ,$$
(7)

The total spatial research leadership flow intensity between leading institution \(a\) and participating institution \(b\) is as

$${\text{SL}}_{ba} = \mathop \sum \limits_{p = 1}^{{P_{ba} }} {\text{SL}}_{p,ba} ,$$
(8)

where \(P_{ba}\) is the number of papers where \(a\) is the leading institution, and \(b\) is a participating institution. We define the direction of spatial research leadership flow from participating institution \(b\) to leading institution \(a\), indicating that \(b\) obtains (cites) the research leadership from \(a\).

Construction of the network

Now, we can construct a geographically weighted network where institutions are nodes and research leadership flows are directed edges. The weight on edge represents the spatial research leadership flows’ intensities between the two institutions. Figure 2 illustrates the construction of the network in pharmaceutical sciences. For more details about network construction, please refer to the supplementary information.

Fig. 2
figure 2

An illustration of a geographically weighted network in pharmaceutical sciences

Spatial research leadership rank

LeaderRank is a simple variant of PageRank but has been widely proved to outperform PageRank regarding ranking effectiveness with good robustness (Li et al., 2014). In LeaderRank, a ground node, which bi-directionally connects to every other node, is added to the existing network with \(N\) nodes and \(M\) weighted directed edges. Figure 3 is an illustration of adding the ground node to the weighted and directed spatial research leadership network. Thus, the network consists of \(N + 1\) nodes and \(M + 2N\) edges, and forms a strongly connected network. LeaderRank performs a standard random walk process to rank every node. However, LeaderRank only considers the topological features of the collaboration network while ignoring other non-topological features, especially the spatial features, which have been recognized as important factors of the academic performance and research impact. Therefore, we propose a new metric, namely the SpatialLeaderRank, to incorporate the spatial features into the measure of research leadership. Specifically, the SpatialLeaderRank of node \(a\) at the time step \(t\) denoted as \({\text{SpatialLeaderRank}}_{a} \left( t \right)\). Thus, the dynamics of SpatialLeaderRank is described by the following iterative process,

$${\text{SpatialLeaderRank}}_{a} \left( {t + 1} \right) = \mathop \sum \limits_{b = 1, b \ne a}^{N + 1} \left[ {\frac{{{\text{SL}}_{ba} }}{{\mathop \sum \nolimits_{c = 1, c \ne b}^{N + 1} {\text{SL}}_{bc} }} \times {\text{SpatialLeaderRank}}_{b} \left( t \right)} \right], \left( 9 \right)$$
(9)

where \({\text{SL}}_{ba}\) is the spatial research leadership flow intensity from institution \(b\) to \(a\). The spatial research leadership flow intensity from other institutions to the ground node and from the ground node to other institutions is set to 1 (Lu & Zhou, 2011). The process starts with the initialization where all institutions’ SpatialLeaderRank being equal to 1. According to the iterative process described by Eq. (9), the SpatialLeaderRank value will converge to a unique and steady-state \({\text{SpatialLeaderRank}}\left( \infty \right)\), \((a = 1,2, \ldots ,N,{ }N + 1)\). We rank all institutions according to \({\text{SpatialLeaderRank}}\left( \infty \right)\).

Fig. 3
figure 3

An illustration of adding the ground node to the geographically weighted network to calculate SpatialLeaderRank

Result and analysis

For simplicity, we mainly introduce the results for pharmaceutical sciences and then summarize the results for ISLS in “Results in Information Science & Library Science” section.

Spatial patterns and evolution of research leadership flows in pharmaceutical sciences

Figure 4 illustrates the distance distribution of research leadership flows during 2010–2018. The data can well fit a power-law distribution \(P\left( k \right) \propto k^{ - \alpha }\). The majority of research leadership flows within a small radius, indicating that most research leaders integrate research resources within a short radius.

Fig. 4
figure 4

The distance distribution of research leadership flow

To present the changes in the research leadership mass in pharmaceutical sciences, we visualize the geographical distribution of research leadership mass in Fig. 5, the kernel density heat map of global research leadership mass in two split periods, 2010–2014 and 2015–2018. The kernel density estimation smooths the spatial coordinates to generate a probability density surface of a set of point locations (Downs & Horner, 2012). We can identify three main research leadership mass clusters, Northeastern United States, the European Union, and Northeastern Asia. More specifically, during 2010–2014, the cluster of European Union covered the largest area, and the Northeastern United States had the highest density. In the Northeastern Asia cluster, Japan and the Republic of Korea led the region. In the Greater China Region, the research leadership mass was mainly distributed in the eastern part. In general, during 2010–2014, pharmaceutical sciences research was dominated by the most economically developed countries. Differently, during 2015–2018, although these developed countries were still playing a prominent role in leading pharmaceutical research, Eastern Asia has emerged as a key leader, with multiple significant clusters in China, Japan, and the Republic of Korea. Remarkably, the Yangtze River Delta Region (the densest region in the figure, including Shanghai, Jiangsu, Anhui and Zhejiang Provinces), the Jing-Jin-Ji Metropolitan Region (including Beijing, Tianjin and Hebei Provinces), and Sichuan Province have emerged as prominent leaders in pharmaceutical research, because of the finest universities and medical schools located in these regions. Other developing countries also had significant improvements, especially India and Iran. In general, the distribution of research leadership mass has become more balanced between western and eastern countries.

Fig. 5
figure 5

The kernel density heat map of the distribution of research leadership mass at the regional level

Figure 6 intuitively presents the detailed evolution of research leadership mass over the year with a bump graph. The width of the bump is proportional to the research leadership mass of the corresponding country/region. We present the top 15 countries/regions in terms of their research leadership mass. At first glance, the bump lines of all countries/regions were widening, indicating that the research leadership masses of all were growing during 2010–2018. In line with Fig. 5, some significant changes were underway. Mainland China had taken #1 from the United States in 2018. Japan retained the place at #3. India and Iran have risen to #4 and #5, respectively. In a nutshell, given these changes during 2010–2018, it is apparent that the distribution of research leadership mass is rebalancing between western and eastern countries.

Fig. 6
figure 6

The bump graph of research leadership mass at the country level

Figure 7 details the cross-linguistic-border research leadership flows at the country level. There are three chord charts, uncovering the cross-linguistic-border research leadership flow pattern in 2010, 2014, and 2018. Also, we present the top 15 countries/regions in terms of cross-linguistic-border research leadership flows. In general, the cross-linguistic-border research leadership pattern is still dominated by western countries. Several developing countries are improving rapidly, such as India, Brazil, etc. Although in Fig. 6, Mainland China has taken the #1 in terms of research leadership mass in 2018, the USA remains the first and far ahead in terms of cross-linguistic-border research leadership flows. This reflects that the increase in research leadership mass of China is mainly domestic.

Fig. 7
figure 7

The chord graph of cross-linguistic-border research leadership flows

Effectiveness analysis of SpatialLeaderRank

To evaluate the effectiveness of the proposed SpatialLeaderRank, we then perform the receiver operating characteristic (ROC) curve analysis to evaluate the ability of SpatialLeaderRank and other conventional indices to identify the top 5% institutions with high academic impact. We further use Ksim (Haveliwala, 2003) to measure the similarity between each indices’ ranking results and the academic impact rank. Last, we compare the detailed rank of institutions in terms of SpatialLeaderRank, other conventional indices, and academic impact indices. The conventional indices include PageRank, betweenness centrality, closeness centrality, indegree centrality, and publication number (Kim & Diesner, 2015; Wu, 2013). And the academic impact indices include citation count, citation-based h-index, altmetrics count, altmetrics-based h-index.

Evaluation

We evaluate the effectiveness of the SpatialLeaderRank by using it to predict institutions’ academic impact, which is measured by four common indices. We also compare its performance with that of conventional indices. Citation count is a widely recognized measure of academic impact (Yan & Ding, 2011). However, simple citation count is not robust against manipulations (Hirsch, 2005). To this end, Hirsch (2005) proposed h-index, the maximum value of h papers being cited at least h times for each entity (author, journal, institution, etc.). H-index combines the quantity and quality of publications and has been widely adopted by the scientific community (Lund, 2019). Recently, Altmetrics indicesFootnote 2 have emerged as popular tools to measure academic impact because they are less subject to publication delays than citation count. Similarly, the simple Altmetrics count is not robust against manipulations either. To this end, Askeridis (2018) proposed an Altmetrics-based h-index, namely, Mendeley-based h-index, which replaces the citation count with the view count of Mendeley readers. Mendeley is a free reference manager and academic social network. The view count of Mendeley users is associated with the later citation count (Aduku et al., 2017) because the view count is crowdsourced from the research community and thus reflects the academic impact earlier than the citation count. In this study, we have two variants of the h-index, the citation-based h-index, and the Altmetrics-based h-index. Both of them will not be large if an entity publishes either many publications with low citation/Altmetrics or very few publications with high citation/Altmetrics. In total, four evaluation metrics (citation, citation-based h-index, Altmetrics, and Altmetrics-based h-index) are employed as the model outcome to evaluate the effectiveness of SpatialLeaderRank. And These evaluation metrics cover the all journal publications of an institution.

We use the ROC curve analysis to examine the capability of SpatialLeaderRank in identifying top institutions with high academic impact (top 5% in terms of citation, citation-based h-index, Altmetrics, and Altmetrics-based h-index) and compare its performance with that of other conventional indices. ROC curve analysis is a graphic method to illustrate the performance of a binary classification system under different recognition thresholds (Hassan et al., 2017; Zhang et al., 2019). The larger AUC (area under the ROC curve) is, the better performance the focal binary classification system has. Figure 8 presents the ROC curve of the SpatialLeaderRank and other indices for identifying the top 5% of institutions with high academic impact in terms of citation, citation-based h-index, Altmetrics, and Altmetrics-based h-index.

Fig. 8
figure 8

ROC curves for SpatialLeaderRank and other indices to identify top institutions with high academic impact

As shown in Fig. 8, all the indices have a reasonably high AUC in terms of all the four evaluation metrics, indicating the generally good performance of all the indices in identifying the top institutions with high academic impact. Note that the high AUC value is expected because of the majority (95%) of institutions are classified as not influential. The proposed SpatialLeaderRank consistently outperforms other indices with all measures of academic impact. This indicates that integrating the spatial features does help to improve the capability in identifying top institutions with high academic impact. The larger SpatialLeaderRank score an institution has, the greater the leadership status that it has, and the higher academic impact that it generates. It is noteworthy that the AUC gaps between SpatialLeaderRank and other indices are more significant in h-index (citation-based/Altmetrics-based) than that in citation/Altmetrics counts. This is because h-index (both the citation-based/Altmetrics-based) is more robust than citation/Altmetrics, and can more closely reflect the academic impact. SpatialLeaderRank has more advantages in identifying top institutions with high academic impact compared with other conventional indices.

The ROC curve analysis only examines the overlap of top-k ranked institutions according to different indices, as it considers these top-k ranked institutions as an unordered set. Thus, it is meaningful to adopt another metric to further examine the relative ordering of the top-k ranked institutions. In this study, we adopt the KSim metric (Haveliwala, 2003) which is based on Kendall’s \(\tau\) distance measure. Consider a ranking list by an index \(\tau_{1}\) and a ranking list by an academic impact evaluation metric \(\tau_{2}\). Let U be the union of the institutions in \(\tau_{1}\) and \(\tau_{2}\). Let \(\sigma_{1}\) be \(U - \tau_{1}\) and \(\tau_{1}^{^{\prime}}\) be the extension of \(\tau_{1}\), where \(\tau_{1}^{^{\prime}}\) contains \(\sigma_{1}\) in addition to the existing ranked institutions in \(\tau_{1}\). The rank of institutions in \(\sigma_{1}\) is set to have the same ordinal rank in the end of \(\tau_{2}\). Similarly, \(\tau_{2}\) is extended to yield \(\tau_{2}^{^{\prime}}\). The Ksim between \(\tau_{1}\) and \(\tau_{2}\) is defined as the Kendall’s distance between \(\tau_{1}^{^{\prime}}\) and \(\tau_{2}^{^{\prime}}\), respectively:

$$KSim\left( {\tau_{1} ,\tau_{2} } \right) = \frac{{\left| {\left( {u,v} \right):\tau_{1}^{^{\prime}} ,\tau_{2}^{^{\prime}} \;{\text{agree}}\;{\text{on}}\;{\text{order}}\;{\text{of }}\left( {u,v} \right), u \ne v} \right|}}{{\left( {\left| U \right|} \right)\left( {\left| U \right| - 1} \right)}},$$
(10)

In other words, KSim measures the probability of \(\tau_{1}\) and \(\tau_{2}\)’s agreement on the relative ordering of a randomly selected pair of institutions \(\left( {u,v} \right) \in U \times U.\)

Figure 9 visualizes the KSim curve between each index and the four academic impact evaluation metrics from the top 5 institutions to the top 500 institutions. In general, SpatialLeaderRank consistently outperforms all other indices. In particular, when N is small, the effect of extreme cases is more significant. In such a case, the advantage of SpatialLeaderRank over other indices is more obvious, indicating that SpatialLeaderRank is more robust to rare extreme cases. As N increases, all indices’ Ksim increase and become stable. Betweenness centrality has the worst performance, according to KSim, in all evaluation metrics. Betweenness centrality of an institution is proportional to the number of shortest paths traversing through the institution. The unsatisfactory performance of betweenness centrality is because most of the top-ranked institutions by betweenness centrality are the local hubs connecting to local institutions. Publication count and indegree have similar performance. Interestingly, we find that if the publications and leading behaviors are mostly associated with local institutions, the academic impact does not get much improved. It’s interesting that closeness centrality has a relatively high Ksim in Altmetrics and, particularly, Altmetrics-based h-index. In Altmetrics-based H-index, when N is larger than 300, closeness centrality’s performance is close to the SpatialLeaderRank, primarily because institutions with high closeness centrality have shorter distance from other institutions, so the spread of research output is faster through the collaboration network.

Fig. 9
figure 9

KSim between each index and four evaluation metrics from top 5 to top 500

Ranking institutions by SpatialLeaderRank, other conventional indices, and academic impact indices

We rank the institutions according to SpatialLeaderRank, the other conventional indices (LeaderRank, publication, indegree, betweenness, closeness, and PageRank) and academic impact indices (citation, citation-based h-index, Altmetrics, and Altmetrics-based h-index). Table 2 presents the top 20 institutions. Table 3 presents the rank of the SpatialLeaderRank-based top 20 institutions in other rankings. The top institutions ranked by SpatialLeaderRank are also highly ranked according to all the four academic impact indices. This strong association indicates that leading long-distance and cross-linguistic-border collaborations generally lead to greater academic impact.

Table 2 Top 20 institutions’rank by SpatialLeaderRank and by other indices
Table 3 Comparison of institution rank by SpatialLeaderRank and other indices

Four institutions (Harvard University, University of Oxford, University of Cambridge, and University of California San Diego) are among the top 20 according to all indices. Particularly, Harvard University is ranked as the top one institution according to four academic impact indices, and the top three according to the other indices. Chinese Academy of Sciences (CAS) and the French National Centre for Scientific Research (French: Centre national de la recherche scientifique, CNRS) are two state research organizations in China and France, respectively. These two mega organizations have multiple research institutions and have published many papers in the field (#1 for CAS and #7 for CNRS) (Table 2). It is not surprising to observe the high ranking of CAS and CNRS according to the collaboration indices. However, their performance in research impact is less significant than that of collaboration, indicating that CAS and CNRS are widely leading the research through collaboration with many other institutions, the academic impact still has room for further improvement. Similarly, we found that many other Chinese institutions are highly ranked (top 20) in terms of the conventional indices such as Shanghai Jiaotong University, Zhejiang University, Fudan University, Peking University, and Sun Yat-sen University in terms of publication, indegree, betweenness centrality, closeness centrality, and PageRank. However, neither their SpatialLeaderRank nor academic impact is highly ranked (top 20). Their publications do not match their academic impact status, suggesting that Chinese institutions should focus on improving the quality of research instead of quantity only. Notably, building on the result of Table 2 and Fig. 7, from a SpatialLeaderRank point of view, Chinese institutions are recommended to expand their research leadership radius and integrate international research resources to embed themselves into the global research collaboration network.

Results in Information Science & Library Science

To check the robustness of the study, we implement the same ROC curve analysis and Ksim measure analysis in ISLS field, as shown in Figs. 10 and 11. The results are consistent with those for pharmaceutical sciences. The proposed SpatialLeaderRank consistently outperformed other indices in predicting the academic impact of institutions. Similar to analysis in pharmaceutical sciences, we present the top 20 institutions in ISLS in Table 4. Interestingly, we found a few reasonable changes in the ranks of certain key players in the field. Specifically, we found that Wuhan University is top-ranked by traditional metrics such as publication, indegree, closeness, and LeaderRank, but only 20 as ranked by the proposed SpatialLeaderRank. The University of Illinois at Urbana-Champaign, the long-standing leader in the field, is ranked #9 by LeaderRank, but #1 by the proposed SpatialLeaderRank. Such results make much sense because although Wuhan University has one of the best ISLS programs in China, its leading efforts are mostly among domestic universities so far. The University of Illinois at Urbana-Champaign, on the other hand, has the top-ranked ISLS program for decades. Its leading efforts are reaching more non-US institutions. This finding echoes the previous discussion for the pharmaceutical field: although the research performance of Chinese institutions has been growing very fast in ISLS field, it is suggested that they expand their leadership to more international collaborations.

Fig. 10
figure 10

ROC curve for SpatialLeaderRank and other indices to identify top 5% institutions in ISLS

Fig. 11
figure 11

KSim between each index and four evaluation metrics from the top 5 to top 500 in ISLS

Table 4 Top 20 institutions’rank by SpatialLeaderRank and by other indices in ISLS

Discussion and conclusions

In this paper, to address the spatial bias in research leadership flows in research collaboration, we examine the spatial distribution and the dynamic trend of research leadership flows in pharmaceutical sciences. We observe that developing countries have been playing an increasingly important role in leading the research in pharmaceutical sciences. However, the cross-linguistic-border research is still dominated by developed countries.

Then, we construct a geographically weighted network, based on which we propose the SpatialLeaderRank. SpatialLeaderRank ranks the institutions integrating both topological features and spatial features. Comprehensive experiments with the data in both pharmaceutical sciences and ISLS demonstrate the superior performance of the proposed SpatialLeaderRank in predicting the academic impact of institutions. Leading institutions are identified and presented.

This study sheds light on the important association between long-distance and cross-linguistic-border collaborations and academic impact. With the growing trend of cross-linguistic-border collaborations, the distance between collaborators, particularly between the research leader and participators, should be an integral part to consider while examining the research leadership.

From a policy perspective, we found a clear rebalancing process between the research leadership mass in developed and developing countries. Many Eastern Asian countries, particularly China, is quickly emerging as a new global leader in pharmaceutical sciences. There are two main reasons for the change. First, the expenditure on research has been increasing rapidly in China (Basu et al., 2018). As of 2018, the research and development (R&D) expenditure in China was 1967.79 billion CNY,Footnote 3 a 179% increase from 2010. China recently passed the European Union in R&D investment (Basu et al., 2018). Meanwhile, the research funding in the United States only increased by 57%.Footnote 4 Apparently, with abundant funding, Chinese institutions are playing the role as the leader more often. Second, the evaluation of research output in China is primarily based on quantitative measures such as publication number, the impact factor of journals, and citation count. The emphasis on these quantitative measures drove the whole academic community to publish as many papers as possible. Despite the success in publication number, the academic impact of Chinese institutions is still laid behind the Western institutions. It is suggested that policymakers in China shift the focus of the research evaluation towards the actual academic impact from quantitative measures. On 18/02/2020, the Ministry of Science and Technology and the Ministry of Education of China jointly published an announcement to urge the Chinese institutions to adopt a more scientific and influence-driven research evaluation approach.Footnote 5 This indicates the start of the transformation from quantity to quality in China.

For the Western countries, including Europe and the United States, they are still playing the role of the major leaders in the field. More research expenditure is needed to maintain a good status. In general, cross-linguistic-border collaboration is playing an increasingly important role in pharmaceutical research. Given the higher impact of long-distance collaborations, cross-linguistic-border collaboration should be encouraged utilizing joint-funding schemes and academic exchanges.

Long-distance and cross-linguistic-border collaborations shall be encouraged to increase the chance to generate high impact research. However, the impact of research should be purely based on its scientific merit.

The proposed SpatialLeaderRank is a general method that can be used to evaluate the leadership at the author and country levels. It is also applicable to examine other types of relationships, such as to evaluate the academic influence of scientific journals, to identify the key innovator in the industry by analyzing the patent-citation network, and to identify the key moderator in the financial system by analyzing the guarantee-relationship network.

Limitations and future research

This study still has some limitations, which provide some potential directions for future research: first, we only study two disciplines: the pharmaceutical field and ISLS. It is recommended that future research should cover a broad set of fields and even all disciplines to enhance the generalizability of our proposed metric. Second, in data processing, for institutions with multiple addresses, we only adopt the first search result of Google Map. But this may lead to a bias for spatial-feature-based studies, particularly for those mega organizations such as CAS with many branches distributed nationwide. Publications divided into mega organizations’ branches would result in different rankings. Future research should identify and locate each author’s school/department address information instead of the institution address information so that we can depict the spatial research leadership more accurately. Since we are studying at the institution level, we ignore individuals’ information. For example, for an author with multiple institutions, we treat them equally. What is more, we do not take researchers’ mobility into consideration. Particularly, there is a time lag between when the research being executed and when the paper in print. Leading authors may visit other countries/institutions, and completed the research, and returned his home institution. In this paper, the leading credit is assigned to his home institution, which will also lead to a bias. In the future, we should measure the research leadership from an individual point of view. Third, we leverage WOS to perform data collection, but only WOS data sources may not have enough coverage of publications in pharmaceutical sciences or ISLS (Tang & Shapira, 2011). We may underestimate the absolute counts of publications in pharmaceutical sciences and ISLS. Future research should merge publication data from multiple data sources, such as ScopusFootnote 6 and Pubmed.Footnote 7