Recurrent-DC: A deep representation clustering model for university profiling based on academic graph

https://doi.org/10.1016/j.future.2020.10.019Get rights and content

Highlights

  • We propose a deep representation clustering model based on academic graph.

  • We transform the excellent index into the quantification of the complexity index.

  • We find a positive relationship between research production and complexity.

  • Experimental results show that the model achieves the state-of-the-art performance.

Abstract

Universities play an important role in exploring new concepts and knowledge transfer. University research naturally forms heterogeneous graphs through all real-life academic communication activities. In recent years, there have been many large scholarly graph datasets containing web-scale nodes and edges. However, so far, for these graph data, characterizing research about university output is focusing on counting the volume or evaluating the excellence of research articles and providing a ranking. This paper proposes a novel University Profiling Framework (UPF) from the production and complexity point of view which is different from other straightforward solutions. The framework includes a novel Recurrent Deep Clustering Model (Recurrent-DC) for the learning of deep representations and clusters. In our model, successive operations in a clustering algorithm are expressed as steps in a recurrent process, stacked on top of representations output by a Stacked Autoencoder (SAE). Our key idea behind this model is that good representations for university clustering task-specific problem can be learned over multiple timesteps. Experimental results illustrate the stability and effectiveness of the proposed model comparing with the other deep clustering and classical clustering methods.

Introduction

Universities play an important role in exploring new concepts and innovation by research, in addition to knowledge transfer through higher education. University academic research communications can be naturally modeled as heterogeneous graphs. Heterogeneous graphs have been commonly used for abstracting and modeling complex systems, in which objects of different types interact with each other in various ways. Academic Graph has received a lot of attention in recent years as an important example of heterogeneous graphs. For example, Microsoft Academic Graph (MAG) [1], contains six types of entities: field of study, author, institution (affiliation of author), paper, venue (journal and conference series, e.g. WWW, SIGIR, KDD, etc.) and event (conference instances). Different types of relationships between entities are also included. These entity relationships are rather intuitive. For instance, the fact that papers get published in journals/conferences justifies the edge between paper and venue nodes in the graph.

Various scientific algorithms have been developed to quantify and assess academic institutes and provide rankings [2], [3], [4], using proprietary or public accessible publication records of the academic graph. However, ranking universities is a challenging task because each institution has its own particular mission. Each institution has its focus and offers different academic programs. Institutions can also differ in size and have varying amounts of resources at their disposal. In addition, each country has its own history and higher education system which can impact the structure of their colleges and universities and how they compare to others. Universities usually consist of complex research disciplines in different faculties or divisions and conduct different research in different fields. The analysis of the whole university is a challenging task. The main contribution of this article is the use of complexity-based institutional evaluation indicators and the development of corresponding deep clustering algorithms. The traditional method is to count the output or count the number of outstanding works. For example, count the number of papers published by an institution in a year, and count the number of papers published by an institution in top journals in a year. In this paper, evaluating excellence is no longer a simple matter of counting outstanding works. We used a complexity-based indicator to replace the number of outstanding works. The complexity of papers published by institutions is not only related to the excellent degree of the papers, but also to the proportion of papers published. Instead of ranking universities, more and more investigators choose to calculate various scientific indicators first and then cluster universities [5]. There are much research about different scientific indicators [6], [7] but few investigate about university profiling task-specific clustering algorithm.

Clustering is one of the most fundamental tasks in data mining and machine learning, with an endless list of applications. It is also a notoriously hard task, whose outcome is affected by a number of factors, including data acquisition and representation, preprocessing, clustering criterion, etc. Since its introduction in 1957 by Lloyd (published much later in 1982) [8], K-means has been extensively used either alone or together with suitable preprocessing, due to its simplicity and effectiveness. K-means is suitable for clustering data samples that are evenly spread around some centroids [9]. Many real-life datasets do not exhibit this specific structure. And many scientific indicators represent special data features that usually do not exhibit this specific structure. This task-specific issue limits the classic clustering algorithm performance.

In recent years, motivated by the success of deep neural networks (DNNs) in supervised learning, unsupervised deep learning approaches are now widely used for representation learning prior to clustering. For example, the Stacked Autoencoder (SAE) [10], make use of DNNs to learn nonlinear mappings from the data domain to low-dimensional latent spaces. These approaches [11] treat their deep neural networks as a preprocessing stage that is separately designed from the subsequent clustering stage. The hope is that the latent representations of the data learned by these deep neural networks will be naturally suitable for clustering. However, since no task-specific objective is explicitly incorporated in the learning process, the learned deep neural networks do not necessarily output data that are suitable for clustering. Besides, there are some approaches [9], [12] attempt their deep neural networks and clustering part optimizing jointly to get a better result. They hope fully use the power of stochastic gradient descent algorithm not only in optimizing deep neural network parameters but also in the clustering assignment. However, for university profiling task clustering, without explicit learning normalization, optimizing jointly will not necessarily output results that are suitable for university clustering task — as will be seen in our experiments.

In this paper, we propose a task dependence model, which alternates between two steps recurrently: updating the cluster assignment given the current representation parameters and updating the representation parameters given the current clustering result. To the best of our knowledge, no prior effort has been made to address the scientific features for university clustering by exploiting deep representation learning. Specifically, we cluster data representations using K-means clustering and represent observable data via activations of a SAE. We regard the university profiling problem as a clustering problem. The production and complexity of each university are modeled as joint vectors. We design an efficient algorithm to optimize the process of vector representation in the joint space and conduct clustering. We conduct various experiments to evaluate the effectiveness of our model. Results show that this method outperforms the other models.

In summary, the main contributions of this paper can be summarized as follows:

  • We propose a new Recurrent Deep Clustering Model (Recurrent-DC). Since the deep representation learning and clustering are recurrent processes, it has higher clustering accuracy and better stability than other state-of-the-art models.

  • We propose a new University Profiling Framework (UPF) for characterizing scientific research institutions in exploring academic graph dateset, which transforms the traditional depiction of excellence into the quantification of complexity indicators.

  • We select the data of top universities from ARWU and apply our framework and model on the Microsoft Academic Graph. Then we find the positive relationship between university research production and university research complexity of different university groups.

The rest of the paper is organized in the following way. Section 2 lays out the theoretical dimensions of the research. Section 3 formally formulates the problem and presents the overall architecture of the proposed solution. Section 4 describes the experimental setups and presents results to illustrate the effectiveness of Recurrent-DC. Besides, we also analyze the results and present the findings focusing on the application and visualization of the research in Section 5. Finally, Section 6 concludes our work and discusses the limitation and the promising future directions.

Section snippets

Related work

The related work has been divided into two parts. The first part deals with the institutional academic assessment in exploring academic graph. The second part presents focuses on the process from clustering to deep representation clustering.

University profiling framework

In this work, we focus on characterizing universities in exploring academic graph. Different from the existing study, we utilize new scientific indicators in our University Profiling Framework to characterizing university rather than traditional excellence or volume indicators. The new scientific indicators contain Research Production Index (RPI), Productivity Value (PV), Research Complexity Index (RCI) and Opportunity Value (OV). And the new Recurrent Deep Clustering Model was developed to get

Experiments

In this section, we will present the details about how we carry out the experiments to certify the accuracy and the effectiveness of our model. We describe how to get the suitable experimental dataset from Microsoft Academic Graph. We compare the proposed Recurrent-DC model with various clustering methods including K-means, Affinity Propagation, Spectral Clustering, Density-Based Spatial Clustering of Applications with Noise, Stacked Autoencoder followed by K-means (SAE+KM) and Deep clustering

Application of university profiling framework

Taking advantage of the University Profiling Framework on the task of characterizing university, we can apply UPF on the MAG dataset to automatically obtain the university clusters. We select top universities according to ARWU as described in the Experiments section.

Table 4 shows the Research Complexity Index, Opportunity Value, Research Production Index and Productivity Value of selected universities. The order of the list follows the MAG 2018 university ID [1]. The top universities

Conclusion

In this paper, we characterize top universities based on the quantification of complexity features in exploring academicgraph. We propose an efficient University Profiling Framework for characterizing scientific research institutions. Specially, we design a new deep representation clustering model, Recurrent-DC, to jointly learn representation and clustering. Experiments on the top university (according to ARWU) in the academic graph dataset demonstrate the effectiveness and efficiency of the

CRediT authorship contribution statement

Xiangjie Kong: Conceptualization, Supervision, Writing - original draft. Jiaxing Li: Methodology, Software, Writing - review & editing. Luna Wang: Investigation, Writing - review & editing. Guojiang Shen: Validation, Writing - review & editing. Yiming Sun: Software, Writing - review & editing. Ivan Lee: Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was partially supported by the National Natural Science Foundation of China under Grant No. 62072409 and 62073295, and the Zhejiang Provincial Natural Science Foundation under Grant No. LR21F020003.

Xiangjie Kong received the B.Sc. and Ph.D. degrees from Zhejiang University, Hangzhou, China. He is currently a Full Professor with College of Computer Science and Technology, Zhejiang University of Technology. Previously. Previously, he was an Associate Professor with the School of Software, Dalian University of Technology, China. He has published over 130 scientific papers in international journals and conferences (with over 100 indexed by ISI SCIE). His research interests include network

References (46)

  • GarcíaJ.A. et al.

    Mapping academic institutions according to their journal publication profile: Spanish universities as a case study

    J. Am. Soc. Inf. Sci. Technol.

    (2012)
  • LeeI. et al.

    Fitness and research complexity among research-active universities in the world

    IEEE Trans. Emerg. Top. Comput.

    (2018)
  • LeeI. et al.

    An observation of research complexity in top universities based on research publications

  • LloydS.

    Least squares quantization in pcm

    IEEE Trans. Inform. Theory

    (1982)
  • YangB. et al.

    Towards k-means-friendly spaces: simultaneous deep learning and clustering

  • VincentP. et al.

    Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion

    J. Mach. Learn. Res.

    (2010)
  • HersheyJ.R. et al.

    Deep clustering: Discriminative embeddings for segmentation and separation

  • J. Xie, R. Girshick, A. Farhadi, Unsupervised deep embedding for clustering analysis, in: Proceedings of the 33rd...
  • XiaF. et al.

    Big scholarly data: A survey

    IEEE Trans. Big Data

    (2017)
  • WilliamsK. et al.

    Scholarly big data information extraction and integration in the citeseer χ digital library

  • PriemJ.

    Beyond the paper

    Nature

    (2013)
  • PradhanT. et al.

    A hybrid personalized scholarly venue recommender system integrating social network analysis and contextual similarity

    Future Gener. Comput. Syst.

    (2019)
  • WuZ. et al.

    Towards building a scholarly big data platform: Challenges, lessons and opportunities

  • Cited by (0)

    Xiangjie Kong received the B.Sc. and Ph.D. degrees from Zhejiang University, Hangzhou, China. He is currently a Full Professor with College of Computer Science and Technology, Zhejiang University of Technology. Previously. Previously, he was an Associate Professor with the School of Software, Dalian University of Technology, China. He has published over 130 scientific papers in international journals and conferences (with over 100 indexed by ISI SCIE). His research interests include network science, mobile computing, and computational social science. He is a Senior Member of the IEEE and CCF and is a member of ACM.

    Jiaxing Li received the B.Sc. degree from Northwest A&F University, China, in 2018. He is currently working toward the master’s degree in the School of Software, Dalian University of Technology, China. His research interests include deep learning, social computing, and data science.

    Luna Wang received M.Sc. degree from Zhejiang University, China. She is currently working in Institute of Science and Technology, Dalian University of Technology. Her research interests include educational big data and knowledge management.

    Guojiang Shen received the B.Sc. degree in control theory and control engineering and the Ph.D. degree in control science and engineering from Zhejiang University, Hangzhou, China, in 1999 and 2004, respectively. He is currently a Professor with College of Computer Science and Technology, Zhejiang University of Technology. His current research interests include artificial intelligence theory, big data analytics, and intelligent transportation systems.

    Yiming Sun is currently an undergraduate of Software Engineering, Dalian University of Technology, China. She is currently working towards the B.Sc. degree. Her research interests include big scholarly data and network science.

    Ivan Lee received BEng, MCom, MER, and Ph.D. degrees from the University of Sydney, Australia. He was a software development engineer at Cisco Systems, a software engineer at Remotek Corporation, and an assistant professor at Ryerson University. Since 2008, he has been a senior lecturer at the University of South Australia. His research interests include smart sensors, multimedia systems, and data analytics.

    View full text