Abstract

This study aims at forming research teams for interinstitutional collaborations. Research institutes have their own purposes and topics of interest. Thus, supporting joint research between multiple institutes, we have to consider not only synergies between scholars but also purposes of the institutes. To solve this problem, we propose a bibliographic network embedding method that can learn characteristics of institutes, not only of each scholar. First, we compose a bibliographic network that consists of scholars, publications, venues, research projects, and institutes. Collaboration styles and research topics of institutes and scholars are extracted by mining subgraphs from the bibliographic network. Then, vector representations of network nodes are learned based on occurrences of subgraphs on the nodes and neighborhoods of the nodes. Based on the vector representations, we train multilayer perceptrons (MLP) to assess collaboration probability between scholars affiliated in different institutes. For training the MLP, we suggest three strategies: (i) considering every collaboration, (ii) focusing on interinstitutional collaborations, and (iii) focusing on collaboration outcomes. To evaluate the proposed methods, we have analyzed research collaborations of POSTECH (Pohang University of Science and Technology) and RIST (Research Institute of Industrial Science and Technology) from 2011 to 2020. Then, we conducted the research team formation for joint research of the two institutes according to two purposes: pure research and commercialization research.

1. Introduction

Research collaborations are one of primary features that affect performances of the research [16]. The existing studies for the research team formation concentrated on synergies between scholars [710]. To predict the synergies, analyzing or embedding bibliographic networks has recently been the most popular approach [11, 12]. These studies searched for adequate collaboration partners of each scholar by analyzing his/her research history. They supposed that structures of bibliographic networks reflect reputations (e.g., the number of citations), research topics (e.g., preferred venues), and even working styles (e.g., sustainability of collaborations) of scholars [1, 5, 6].

However, this approach does not consider that scholars are not the sole stakeholder of research. As employers and funding sources, research institutes influence research directions and outcomes of scholars. For interinstitutional research projects, the institutes evaluate team members and counterparts of their joint projects according to individual research interests and purposes, as we carefully choose our collaboration partners. For example, POSTECH (Pohang University of Science and Technology) is a research-oriented university. This institute encourages its members to publish research articles with scientific impact. On the other hand, RIST (Research Institute of Industrial Science and Technology) aims to develop practical technology and prefers patents rather than papers. Thus, when members of POSTECH find their collaborators, they may prefer scholars who published many high-impact papers in other research-oriented institutes. However, if scholars in POSTECH want to commercialize their research outcomes, scholars in RIST can be a collaboration partner. Individual expertise and interest of the institutes can be discovered from bibliographic networks. Scholars in POSTECH will focus on papers rather than patents, and RIST might be contrary to POSTECH. Also, as a university, POSTECH covers much broader research areas than RIST. Thus, contributions from POSTECH will be published at more various venues compared to those from RIST.

A comparison of POSTECH with RIST shows differences caused by types of research institutes. However, within the same type, research institutes have individual characteristics according to their research interests. Figure 1 shows topic distributions of papers published by scholars in three major research-oriented universities in Korea. Although the three institutes share common research topics, their priorities for the topics are different. Also, the priorities can be correlated with infrastructures for each research field. In team formation for a project, we should match the project’s research fields with participating institutes’ expertise.

To conduct the research team formation, we should consider both characteristics of each of the scholars and their affiliation. There can be scholars who prefer intrainstitutional collaborations or are not familiar with collaborations. Scholars can also prefer particular types of institutes as collaboration partners (e.g., companies or universities). We can extract collaboration styles of both stakeholders (scholars and institutes) from the bibliographic networks. First, affiliations of collaborators of each scholar reveal what kinds of institutes are preferred by the scholar as collaboration partners. Second, venues of publications written by the scholars show their research interests. Finally, structures of the bibliographic networks represent more detailed research styles of the scholars, such as whether they focus on a few high-quality papers or write prolifically [5, 6]. Also, the structures can reveal working styles of research groups; for example, all group members focus on a research topic, the group leader manages multiple independent projects, or plural middle managers lead individual projects [1].

Therefore, in this study, we propose a method for forming research teams that can consider both collaboration styles of individual scholars and aims of research institutes by embedding bibliographic networks. First, this study suggests the interinstitutional collaboration network (Figure 2). This network includes information for research institutes and projects funded by the institutes, which are barely dealt with by the existing studies. Then, we apply the substructure-based graph embedding methods [1, 1316] for representing scholars, publications, venues, research institutes, and projects with a fixed-size vector. Collaboration probabilities between scholars are estimated based on the vector representations. To consider individual characteristics of research institutes, we have the following assumptions:(i)RQ 1. Interinstitutional collaborations have distinct characteristics from other types of collaborations.(ii)RQ 2. Characteristics of research institutes affect collaborations between the institutes.(iii)RQ 3. Research institutes have individual interests in topics, types of publications, and so on, and the interests affect employees of the institutes.

Based on the assumptions, we propose three approaches for the interinstitutional team formation: (i) considering every collaboration, (ii) focusing on collaborations between target institutes (based on RQ 1 and RQ 2), and (iii) focusing on collaboration outcomes preferred by the target institutes (based on RQ 1, RQ 2, and RQ 3). The three approaches were evaluated based on research outcomes of POSTECH and RIST from 2011 to 2020. By comparing (i) with the other two approaches, we can validate RQ 1. A comparison of (i) with (ii) can verify RQ 2. Finally, RQ 3 can be validated by comparing (iii) with the others and examining performances of the proposed methods for different types of publications (e.g., papers and patents). Contributions of this study can be categorized as follows:(i)Modeling and embedding the interinstitutional collaboration network: This study proposes a novel bibliographic model representing interinstitutional collaborations and a model for embedding the proposed network. Finally, we propose three approaches for predicting collaboration probabilities by using the embedding vectors.(ii)Discovering features of the interinstitutional team formation: The three approaches for team formation are based on individual features. The first one focuses on collaboration styles of each scholar. The second and third approaches consider collaboration styles of both research institutes and scholars and research interests of the institutes, respectively. Thus, experimental results for the approaches can exhibit these features’ significance for the interinstitutional team formation.(iii)Validating distinctiveness of interinstitutional collaborations: The comparisons between the three approaches also validate the fundamental assumptions of this study. The validation assures that we need specialized methods for composing interinstitutional research teams. Our findings can also be applied to other bibliography analysis tasks, such as predicting research institutes’ performances and matching employers (institutes) and employees (scholars).

The remainder of this paper is organized as follows. Section 2 introduces the existing studies for the research team formation. In Section 3, we introduce the interinstitutional collaboration network, and we propose methods for embedding the network and for composing interinstitutional research teams. Section 4 explains experimental procedures for evaluating the proposed methods and validates their effectiveness based on the experimental results. Section 5 presents concluding remarks and future research directions.

There have not been studies for forming interinstitutional research teams. Although Purwitasari et al. [17] have proposed a team formation method for interdepartmental research collaboration, this method considers only topics of publications and does not consider departments/institutes as one of the stakeholders of research. Hernandez-Gress et al. [18] analyzed bibliographic data to recommend collaborations between universities by using only research topics of each scholar. Additionally, Guerrero-Sosa et al. [19] analyzed internal and external research collaborations of Universidad Autónoma de Yucatán, but their analysis results were limited in the data statistics. Therefore, this study validates whether research institutes are significant stakeholders of research and proposes team formation methods that can consider the interests of both institutes and scholars.

Looking up from the interinstitutional research, there have been numerous studies for recommending research collaborators. Most of the existing studies applied link prediction techniques on bibliographic networks. They extracted various features from research publications or bibliographic networks by searching for scholars who can potentially (or sustainably [20, 21]) collaborate. Structures of bibliographic networks provide various information for bibliographic entities (e.g., scholars, publications, and venues) [1, 5, 6]. Regarding scholars, coauthorship relations show which types of collaborators are preferred by each scholar [1]. Temporal changes in coauthorship relations also reveal the sustainability of collaborations [6]. By analyzing structures of citation networks, we can extract publications’ scientific impact and topical relevancy between publications [20]. Even without citations, relations between scholars and venues partially represent research topics of scholars [5]. Therefore, various studies [810, 12, 2024] attempted to extract structural features of the bibliographic networks and to apply to predicting future coauthorship. To deal with the structural features, affinity propagation based on random walks was the most popular [9, 10, 20, 25]. However, recently, network embedding models enable us to represent the structural features by using low-dimensional fixed-length vectors [1, 5, 6, 12, 26]. Due to the vector representations, we can use conventional machine learning techniques to predict the collaboration probability without much modification.

There have been mainly two kinds of embedding models: proximity-based and structure-based models. If we employ proximity-based models [26], the obtained vector representations will have high similarity for scholars in the same community. However, we can search for collaborator candidates in a circle of acquaintance by ourselves. Also, some scholars prefer collaborators who come from diverse research groups [1]. Therefore, for the practicality of team formation methods, we have to provide unexpected collaborator candidates that are similar to previous collaborators of users. This study employs a structure-based network embedding model and modifies it to apply to the proposed bibliographic network.

Although bibliographic network structures reflect research topics of publications and scholars, they are difficult to be as accurate as analyzing the publications’ content. Therefore, various studies applied topic modeling [7] and word/document embedding [9, 26] techniques to textual data in publications with an assumption that the scholars who deal with similar research fields can collaborate together [710, 12, 26, 27]. Obviously, information for research topics is valuable for team formation. If we make matches between two scholars in irrelevant domains, they are difficult to collaborate however talented they are. Nevertheless, this assumption cannot deal with forming interdisciplinary research teams, despite its significance for pioneering new research areas and providing practical experiences to scholars [28, 29]. We can also analyze probabilities of interdisciplinary research by combining the research topic information with bibliographic network structures. However, analyzing academic publications’ content is out of coverage of this study. Our further research will attempt to cover the combination of two kinds of information.

Additionally, a few studies used statistical features extracted from bibliographic data. Bibliometrics (e.g., -index) are effective to represent performance of scholars (and other kinds of bibliographic entities) with a single value [21, 27]. However, each of the bibliometrics reflects only fragmentary aspects of research. When a scholar wrote a few high-impact publications, another scholar published numerous intermediate publications, and they have the same -index, it is not difficult to say which scholar has a better performance than the other. Even a few existing studies validated that network embedding models can reflect features represented by the bibliometrics [1, 5, 6]. Also, career ages of scholars were used in several existing studies [10, 21, 27, 30]. Nevertheless, this information is already included in bibliographic networks, and we do not always require collaborators who have similar career ages with us.

In summary, the existing methods have mainly two limitations. First, the existing studies suppose that scholars are the only stakeholder of research. However, as discussed in Section 1, research institutes have their own research interests and purposes. Also, scholars are influenced by the interest and purposes, as employees of the institutes. Second, sharing research topics or being active in the same research communities is not always good for research collaborations. To conduct research, which is a cooperative task, we need team members who can serve individual parts. Thus, a method that can consider both scholars’ diverse roles and research institutes’ purposes is required.

3. Interinstitutional Research Collaboration Prediction

This study aims at composing interinstitutional research teams by considering characteristics of both research institutes and their members. We have improved the conventional methods in terms of the three following points:(i)The proposed bibliographic network model covers information for research institutes and projects.(ii)Substructure-based graph embedding methods enable us to reveal research interests and expertise of institutes/scholars.(iii)We propose the three approaches for learning collaboration history of target institutes. The approaches were evaluated and compared with each other in Section 4.

3.1. Interinstitutional Collaboration Network

Most of the existing studies only use coauthorship relations for analyzing/predicting collaborations. However, using solely coauthorship has difficulties for discovering characteristics of scholars and institutes in collaborations, such as research interests, roles in research groups, and expertise. Therefore, we extend the conventional bibliographic network, which consists of scholars, publications, and venues, to cover research institutes and projects. The proposed network model is defined as follows.

Definition 1. (Interinstitutional Collaboration Network). This study defines the bibliographic network as a heterogeneous network, which has multiple kinds of nodes and relations. The bibliographic network contains five kinds of nodes: scholars , publications , venues , institutes , and projects . Between these nodes, there are five kinds of relations: a scholar “writes” an academic publication , an academic publication is published in a venue , a scholar “is affiliated in” an institute , a scholar can “participate in” a project , and an academic publication can “be a result of” a project .
This can be formulated asEdges in the network represent only existence of the relations, and the edges connect only heterogeneous nodes (not necessary to annotate edge directions). Thus, the interinstitutional collaboration network is undirected and unweighted. Figure 2 illustrates an example of the bibliographic network, where , , , , , , , , , and .
As shown in Figure 1, we can reveal characteristics of research institutes, such as their preferences for research fields, using only publication records. However, this information does not include collaboration styles of the institutes. Also, we assume that scholars’ choices for their collaborators are different according to their and their collaborators’ affiliations. This point can be revealed by a metapath, ----. This metapath represents preferences of research institutes for partner institutes. Research interests and aims of the institutes will also be reflected by ---. Projects nodes enable us to know whether joint projects between target institutes have been successful or not (-----). We can also analyze the sustainability of interinstitutional teams after the joint projects are finished. The sustainable teams will be benchmarks for composing productive research teams.

3.2. Bibliographic Network Embedding

Adjacency-based graph embedding methods (e.g., LINE [31]) can be effective for revealing preferences of scholars and research institutes. If and collaborate frequently and wrote a number of publications with , these methods will assign close vector representations to the three scholars. Then, will be one of collaborator candidates of with high priority. When all the three scholars have similar roles in their collaboration, this recommendation is reasonable. However, scholars with the same expertise will not have much motivation for collaboration. If has been advising and as a domain expert, and will not have much reason to work with each other.

Our previous study [1] showed that substructure-based graph embedding methods can resolve this issue. These methods assign similar vector representations on nodes that have similar substructures. In the above example, if prefers applying his/her own expertise to various domains, substructures rooted in will have the star topology. The various domains will also be revealed by diversity of scholars and venues connected with . Otherwise, and will be connected with less diverse venues than .

This point is the same for discovering characteristics of research institutes and projects. Universities will have connections with more various venues than nonuniversity research institutes, which mostly have particular research fields. Also, participants of pure research projects will be members of universities rather than of companies. On the other hand, both universities and companies will participate in projects for technology commercialization.

Therefore, this study applies Subgraph2Vec [13], which aims at embedding subgraphs rooted in each node, on the bibliographic network. Subgraph2Vec consists of WL (Weisfeiler-Lehman) relabeling process [32] and Word2Vec [33]. This model assigns close vectors on subgraphs rooted in the same (or adjacent) nodes.

First, WL relabeling is a method for describing subgraphs rooted in each node exactly. This method assigns new labels on each node by using labels of itself and its adjacent nodes, iteratively. For example, on Figure 2 has , which is its node type, as an initial label. At the first iteration, we check labels of neighborhoods of , for example, of , of , and of . Then, gets a new label, . By iterating this process, scales of subgraphs represented by the labels become wider. To observe network structures with multiple scales, we call labels generated at the -th iteration “subgraphs on degree ” and describe substructures rooted in a node as a set of the subgraphs. In practice, we sort the labels of neighborhoods and apply the hash function on the new label to avoid making redundant labels. Algorithm 1 presents procedures of the WL relabeling on our bibliographic network model, where indicates the subgraph rooted in on degree , denotes a subgraph dictionary, and refers to the maximum degree.

(1)procedureWLRELABELLING
(2)    fordo
(3)   fordo
(4)     
(5)         
(6)     
(7)     Put into
(8)   fordo
(9)     
(10)         
(11)     
(12)     Put into
(13)   fordo
(14)     
(15)     
(16)     Put into
(17)   fordo
(18)     
(19)     
(20)     Put into
(21)   fordo
(22)     
(23)     
(24)     Put into

To apply Word2Vec on subgraphs, we have to define ranges of their neighborhoods. In texts, sentences are sequences of words, and neighboring words can easily be extracted using sliding windows. However, nodes in networks are not sequential. Therefore, we define neighborhoods based on adjacency of nodes and degrees as with the previous study [1]. Neighborhoods of can be formulated aswhere is a widow size for the degree. The same way is used to compose neighborhoods for other node types.

To embed the subgraphs, we use the SkipGram and negative sampling [33]. This can be formulated aswhere denotes a noise distribution of subgraphs, indicates the number of negative samples, and denotes the projection function. In this study, is a uniform distribution. is obtained by concatenating to .

3.3. Research Collaboration Prediction

We use the conventional MLP (Multilayer Perceptron) model to predict interinstitutional collaborations. The MLP model consists of three fully connected layers and one drop-out layer. Inputs of the model are -dimensional vectors composed by concatenating vector representations of two scholars. An activation function of this model’s output layer is the sigmoid function, and the other layers use the ReLu (Rectified Linear Unit) function as their activation functions. This model predicts collaboration probabilities between two scholars, and scholar pairs are classified into two groups that are appropriate for collaboration and not. As a loss function, the binary cross entropy is applied.

In this study, we focus on the interinstitutional collaborations that should consider not only relationships between individual scholars but also relationships between research institutes and between scholars and institutes. Research institutes have their own purposes, and members of the institutes also should concentrate on occupational research. Therefore, we cannot ensure that training the model to predict every collaboration (scholar-publication-scholar relations) in the bibliographic network is the best approach for learning the individual characteristics of research institutes. Therefore, we propose two more approaches based on our research questions (in Section 1) to make the model reflect agendas of the target institutes and compare them with the conventional approach (i.e., learning all the previous collaborations). The three approaches for training the MLP model are as follows:(i)Case 1: Learning all the collaboration relations in the bibliographic network.(ii)Case 2: Learning previous collaborations between the target institutes (RQ 1 and RQ 2).(iii)Case 3: Learning collaborations that produced similar publications to previous collaborations between the target institutes (RQ 1, RQ 2, and RQ 3).

The first case supposes that the bibliographic network embedding method can represent characteristics of research institutes and their collaborations despite their diversity. Thus, this case assumes that scholars’ vector representations include information for purposes and preferences of the scholars’ affiliations. In this case, the MLP learns all the collaborations in the bibliographic network, as shown in Figure 3(b), and we use the trained model to predict probabilities of further collaborations between scholars from target institutes. Therefore, this approach makes the prediction model reflect the general characteristics of research collaborations. Although the general characteristics cover interinstitutional collaborations, this will not be as clear as focusing on only interinstitutional collaborations. Thus, we use this approach as a baseline for validating whether interinstitutional collaborations have distinctive characteristics compared to the others (RQ 1).

The second case, which is based on RQ 1 and RQ 2, focuses on searching for scholars that are appropriate for collaborations between the target institutes. There will be scholars who prefer collaborations but only intrainstitutional collaborations or only particular partner institutes. If a scholar has preferences according to reputations or types of institutes, our embedding model can extract the information from publications and venues connected with the institutes. The institutes will also concern whether the scholar can conduct research that they expect. For example, POSTECH and RIST are significant research partners of each other. However, not all the scholars in the two institutes participated in collaborative studies between the institutes. Thus, we can assume that there will be a certain type of scholars that are appropriate for mutual interests of the institutes. Therefore, this case uses bibliographic networks that consist of scholars in the target institutes as a dataset. Then, we train the MLP to predict whether a group of scholars from the respective institutes has previous collaborations, as shown in Figure 3(c). By comparing this case with the first one, we can reveal whether research institutes’ characteristics affect their employees (RQ 2).

We have designed the third approach based on all the research questions (RQ 1, RQ 2, and RQ 3). This case especially concentrates on the fact that research institutes have individual agendas and preferable kinds of publications (RQ 3). Thus, we first find academic publications that are similar to outcomes of previous collaborations between the target institutes by clustering publications in our bibliographic network according to their vector representations. Then, we search for scholars who have written publications that are in the same clusters with the previous collaboration outcomes. We assume that scholars are capable of conducting research that the target institutes expect from their collaborations. When publications that come from collaborations between POSTECH and RIST are in cluster A, research groups that wrote publications in cluster A will let us know compositions of research groups that are appropriate for collaborations between the two institutes. Thus, in this approach, the MLP model learns only the research groups which produced research outcomes that are similar to the previous collaboration outcomes of the target institutes, as shown in Figure 3(d). By comparing this approach with the others, we can validate whether research institutes have preferences for types or topics of publications (RQ 3).

4. Evaluation

To evaluate the proposed methods, we predicted interinstitutional collaborations by analyzing previous collaboration history. Also, our research questions were validated by comparing the performances of the proposed methods with each other. We supposed that research institutes have preferences for topics and types of their members’ research outcomes (RQ 2 and RQ 3). Thus, we should collect multiple types of academic publications, although the existing studies mostly dealt with one type. The multiple types caused a limitation in our experiments. Unlike papers with numerous well-organized academic databases (e.g., DBLP and Scopus), it is not easy to expect accurate publication records for patents or technical reports published by each research institute. Thus, we collected the paper dataset from the open academic databases and acquired a patent dataset by directly requesting it to research institutes. Due to this point, we could not conduct the experiments on a large-scale dataset for multiple research institutes. Nevertheless, publication records of research institutes include their collaborating institutes. Thus, the proposed methods made answers by analyzing hundreds of research institutes’ characteristics, although they predict collaborations between a few institutes.

We collected papers and patents published by scholars in POSTECH and RIST from January 2011 to September 2020. The papers were gathered through the affiliation profile pages on Scopus1, and RIST provided bibliographic data for the patents. Our bibliographic network consists of the papers, patents, and every scholar/institute/venue connected with the papers and patents. We composed the network for two time periods: 2011–2015 and 2016–2020. The proposed methods were trained by the bibliographic data from 2011 to 2015 and validated based on the collaborations from 2016 to 2020. In our dataset, papers’ author names are in English, and patents’ inventor names are in Korean. Thus, we could not build a unified network for both types of publications. We constructed two separate networks and compared the performances of the proposed approaches on the two networks to validate whether research institutes (and their members) have distinct characteristics. Table 1 presents statistics of the bibliographic networks.

The three approaches proposed in Section 3.3 were evaluated based on accuracy for predicting collaboration outcomes between POSTECH and RIST. The accuracy was assessed using three metrics: precision, recall, and measure. When we measure accuracy of predicting collaborations between and , these metrics are calculated aswhere and are sets of predicted and actual collaborations between two institutes, respectively, and , , and indicate precision, recall, and measure for predicting collaborations between two institutes, respectively. We compared the performances of the proposed approaches with a performance of a baseline method and also with each other. As the baseline, we use Case 1, one of the proposed approaches, to predict all the collaborations. A comparison of this case with the proposed approaches exhibits the necessity of methods specialized in predicting interinstitutional collaborations. Table 2 presents experimental results.

Additionally, we heuristically tuned hyperparameters of the proposed methods. The number of dimensions for subgraph vectors was 100, and the maximum degree was 4. The MLP model for predicting collaborations includes three fully connected layers that have 200, 150, and 80 nodes. The threshold of its drop-out layer was 0.2. Also, the number of epochs and learning rates were set as 50 and 0.0008, respectively.

4.1. RQ 1: Distinct Characteristics of Interinstitutional Research Collaboration

The motivation of this study is that we need a collaboration prediction method specialized in interinstitutional collaborations. The necessity can be validated by comparing the performance of Case 1 with the performance of Case All. These two cases present accuracies of the same model for different targets. Case 1 shows accuracy for predicting interinstitutional collaborations, while Case All is for every collaboration. Therefore, the result that Case All had higher accuracy than Case 1 also underpins RQ 1.

The performance decrements between Case 1 and Case All were similar in predicting collaborated papers and patents. However, both Case 1 and Case All performed higher accuracy on papers than on patents. Otherwise, Case 2 and Case 3, which focus on the previous collaborations, performed higher accuracy on patents than on papers. We can assume that patents get more influence from the characteristics of interinstitutional collaborations than papers, although we should also consider that scholars in RIST barely write papers (62 papers from 2011 to 2020). For this point, we should experiment again with a larger dataset containing more institutes and publication types in further research.

Different diversities of publication types can also cause this result; papers are more diverse than patents. Precision and recall of Case 1 were similar to each other on predicting collaborated patents between POSTECH and RIST. However, its precision for predicting papers collaborated by the two institutes was much higher than its recall. This problem was worse in Case 2 that learns only the previous collaborated papers. Otherwise, Case All and Case 3 performed small deviations between their precision and recall on both patents and papers. These two cases might be less affected by the diversities of publications. Since Case All learned general characteristics of the research collaboration, this case gains capability for handling the diversities. On the other hand, Case 3 searched for scholars who can produce the same kinds of research outcomes as the previous collaborations between POSTECH and RIST. Thus, this case learned both the diversities and the two institutes’ characteristics. Conclusively, both types of research publications were affected by the distinct characteristics of interinstitutional research. However, due to the diversity of papers, we need more samples and better methods for extracting features of papers produced by interinstitutional teams.

4.2. RQ 2: Correlations of Research Collaborations with Affiliations

Case 2 learns only interinstitutional collaborations between target institutes, while Case 1 is based on all the collaborations. Also, Case 2 emphasizes scholars who participated in the collaborations, compared to Case 3 that focuses on publications. Case 2 could not outperform Case 1 in predicting papers collaborated by POSTECH and RIST. However, this result might come from RIST’s lack of interest in writing papers; the next section provides detailed discussions. Otherwise, Case 2 exhibited the best performance in predicting collaborated patents of the two institutes. Its accuracy nearly caught up with the accuracy of Case All (prediction for every collaboration). Additionally, Case 2 exhibited a reasonable precision (0.77) for predicting the collaborated papers, despite its low recall. Case 2 could extract characteristics of previous collaborations, but the characteristics did not have enough generality due to the lack of samples. These results underpin that focusing on previous collaborations between the institutes is more effective than learning the general characteristics of the research collaboration.

RQ 2 is the assumption that characteristics of research institutes influence collaborations of their members. By comparing Case 1 with Case All, we found out that interinstitutional collaborations between particular institutes have unique characteristics compared to the other collaborations conducted by the institutes. Then, Case 2 revealed that we could find and utilize the unique characteristics. Also, differences in accuracy for papers and patents might be caused by the fact that scholars in RIST barely write papers, and we did not restrict our dataset to research conducted on duty. In other words, research institutes have preferences for certain types and topics of research outcomes, and the preferences affect research of scholars in the institutes. Conclusively, we can say that characteristics of our affiliations affect our research and research collaborations.

4.3. RQ 3: Preferences of Research Institutes for Collaboration Outcomes

Case 3 aims at finding kinds of publications preferred by the research institutes and forming research groups that are capable of producing the same kinds of research outcomes. Case 3 could not outperform Case 2 that concentrates on previous participants in collaborations between the institutes. However, performance gaps between them were not significant, and Case 3 had a higher recall for predicting collaborated papers than Case 2. These results underpin that styles of expected publications are as significant as characteristics of scholars to predict interinstitutional collaborations. Also, the effectiveness of expected publications means that research institutes have preferences for certain kinds of research outcomes, although we do not know what exactly determines the “kinds.”

We can see this point also in a comparison of accuracy for collaborated papers with that for collaborated patents. Case 2 and Case 3 outperformed Case 1 in predicting collaborations that produced patents, while they showed contrary results in predicting collaborated papers. Case 1 had a strong point in learning more diverse scholars, publications, and venues, while Case 2 and Case 3 restricted the ranges of training data. Thus, we can assume that papers are more various than patents. Case 1 and Case 2 performed lower recall than their precision in predicting collaborated papers. For Case 2, the collaborated papers from 2011 to 2015 might not be enough to represent collaborations between POSTECH and RIST. However, different results of Case 1 and Case All are difficult to be explained. We carefully conjecture that there were changes in their collaborations for papers between the two time periods (2011 to 2015 and 2016 to 2020). To understand these results more clearly, we should conduct experiments with more institutes and publication types in further research. Differences in purposes of POSTECH and RIST could worsen the issue. We interviewed staff of the technology licensing office of RIST to find the differences. According to the staff, RIST concentrates on applying its research outcomes for patents and restricts its members’ academic papers. Otherwise, POSTECH is a research-oriented university, and it barely intervenes in the dissemination of research outcomes. Thus, the paper dataset was not enough to represent scholars in RIST; only 62 papers were written by members of RIST during the recent ten years, while 2,862 and 2,762 patent applications were published by RIST and POSTECH during the same period.

Conclusively, institutes expected particular styles of publications from their collaborations, and the expectation was effective for the interinstitutional research team formation. Also, there were significant differences between types of publications (e.g., journal articles, patents, books, etc.), and research institutes occasionally had preferences for publication types. However, we should construct a unified bibliographic network that includes more research institutes and various academic publications to validate RQ 3 more clearly.

5. Conclusion

This study aims at the interinstitutional research team formation. The existing methods for composing research teams barely considered characteristics of research institutes, although the institutes have individual research interests and aims. We have proposed methods for extracting features of both research institutes and scholars and methods for composing interinstitutional teams based on both sides’ characteristics. First, we extended the conventional bibliographic network to represent research institutes’ characteristics and embedded the network. Based on vector representations of scholars and publications, we have proposed three methods for predicting collaboration probabilities between scholars in target institutes. The three methods have different ranges of training data: (i) all the previous collaborations, (ii) collaborations between target institutes, and (iii) publications preferred by the target institutes.

We evaluated the three prediction methods and validated our assumptions by predicting collaborations between POSTECH and RIST from 2016 to 2020 by learning their collaborations from 2011 to 2015. From the experimental results, we found that interinstitutional research collaborations have distinct characteristics compared to other types of collaborations. Also, as we expected, publications of scholars were affected by their affiliations, and this influence obviously had correlations with collaborations of the scholars. Lastly, some institutes had preferences for particular types of publications. These correlations and preferences were helpful for predicting future collaborations.

Despite the reasonable accuracy of the proposed methods, they have also shown several limitations as follows:(i)Scale of dataset: We conducted experiments for only two institutes, and we could not integrate the bibliographic data for papers and patents due to the author name disambiguation problem. We should construct a unified bibliographic network that includes more research institutes and various academic publications to validate RQ 3 more clearly. Also, our experiment for predicting collaborated papers could not be generalized enough due to the lack of Scopus-indexed papers published by RIST. Thus, we should diversify our data sources, for example, collecting domestic journals and conferences. Considering more research institutes can improve this problem.(ii)Collaboration prediction methods: To predict interinstitutional collaborations, we simply used the conventional MLP model. Our assumptions were applied to only adjusting ranges of training and testing data. Although this approach performed reasonable accuracy and was enough to validate the assumptions, the accuracy can be improved by employing more sophisticated team formation methods. Also, we will attempt to combine the assumptions with prediction models.(iii)Content of academic publications: We supposed that publications’ venues and authors imply the publications’ content. However, this approach could not be as accurate as analyzing the content directly. Also, in the case of patents, their venues are patent offices of each nation. Thus, their venues can be correlated to their impact but not to research domains. This point will be the same for technical reports and preprints. In further research, we will attempt to combine content analysis for academic publications with the proposed team formation methods.

Data Availability

The bibliographic data used to support the findings of this study were supplied by RIST (Research Institute of Industrial Science and Technology) under license and so cannot be made freely available. Requests for access to these data should be made to RIST (http://www.rist.re.kr).

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ICT Consilience Creative program (IITP-2019-2011-1-00783) supervised by the IITP (Institute for Information and communications Technology Planning and Evaluation). Also, this study was supported by the RIST (Research Institute of Industrial Science and Technology), Korea, under Grant no. 2020K011.