Abstract
We conduct an independent cluster validation study on published clustering solutions of a research testbed corpus, the Astro dataset of publication records from astronomy and astrophysics. We extend the dataset by collecting external validation data serving as proxies for the latent structure of the corpus. Specifically, we collect (1) grant funding information related to the publications, (2) data on topical special issues, (3) on specific journals’ internal topic classifications and (4) usage data from the main online bibliographic database of the discipline. The latter three types of data are newly introduced for the purpose of clustering validation and the rationale for using them for this task is set out. We find that one solution based on the global citation network achieves better results than the competitors across three validation data sources but that another solution based on bibliographic coupling performs best on the special issues data.
Similar content being viewed by others
Data availability
The original validation data is published at https://zenodo.org/record/4061694.
Notes
The reason for this result could be that these two solutions, eb and en, unlike the other ones, do not rely on direct citation, which is likely to be relatively rare between papers of a single special issue, but on bibliographic coupling and NLP-enhanced text similarity.
References
Ahlgren, P., Chen, Y., Colliander, C., & van Eck, N. J. (2020). Enhancing direct citations: A comparison of relatedness measures for community detection in a large set of PubMed publications. Quantitative Science Studies, 1(2), 714–729. https://doi.org/10.1162/qss_a_00027.
Bollen, J., Van de Sompel, H., Hagberg, A., Bettencourt, L., Chute, R., Rodriguez, M. A., et al. (2009). Clickstream data yields high-resolution maps of science. PLoS One, 4(3), e4803. https://doi.org/10.1371/journal.pone.0004803.
Boyack, K. W. (2017). Investigating the effect of global data on topic detection. Scientometrics, 111(2), 999–1015. https://doi.org/10.1007/s11192-017-2297-y.
Boyack, K. W., & Klavans, R. (2010). Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61(12), 2389–2404. https://doi.org/10.1002/asi.21419.
Boyack, K.W., Newman, D., Duhon, R.J., Klavans, R., Patek, M., Biberstine, J. R., & Börner, K. (2011). Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches. PloS ONE, 6(3). https://doi.org/10.1371/journal.pone.0018029
Glänzel, W., & Thijs, B. (2017). Using hybrid methods and ’core documents’ for the representation of clusters and topics: The astronomy dataset. Scientometrics, 111(2), 1071–1087. https://doi.org/10.1007/s11192-017-2301-6.
Gläser, J., Glänzel, W., & Scharnhorst, A. (2017). Same data-different results? Towards a comparative approach to the identification of thematic structures in science. Scientometrics, 111(2), 981–998. https://doi.org/10.1007/s11192-017-2296-z.
Halkidi, M., Vazirgiannis, M., & Hennig, C. (2015). Method-independent indices for cluster validation and estimating the number of clusters. In C. Hennig, M. Meila, F. Murtagh, & R. Rocci (Eds.), Handbook of cluster analysis (pp. 616–639). Chapman & Hall/CRC.
Havemann, F., Gläser, J., & Heinz, M. (2017). Memetic search for overlapping topics based on a local evaluation of link communities. Scientometrics, 111(2), 1089–1118. https://doi.org/10.1007/s11192-017-2302-5.
Klavans, R., & Boyack, K. W. (2017). Which type of citation analysis generates the most accurate taxonomy of scientific and technical knowledge? Journal of the Association for Information Science and Technology, 68(4), 984–998. https://doi.org/10.1002/asi.23734.
Koopman, R., Wang, S., & Scharnhorst, A. (2017). Contextualization of topics: Browsing through the universe of bibliographic information. Scientometrics, 111(2), 1119–1139. https://doi.org/10.1007/s11192-017-2303-4.
Kurtz, M. J., Eichhorn, G., Accomazzi, A., Grant, C., Demleitner, M., & Murray, S. S. (2005). Worldwide use and impact of the NASA Astrophysics Data System digital library. Journal of the American Society for Information Science and Technology, 56(1), 36–45. https://doi.org/10.1002/asi.20095.
Kurtz, M. J., Eichhorn, G., Accomazzi, A., Grant, C. S., & Murray, S. S. (2002). Second-order bibliometric operators in the Astrophysics Data System. Astronomical Data Analysis II, 4847, 238–245. https://doi.org/10.1117/12.460438.
Kurtz, M. J., & Henneken, E. A. (2014). Finding and recommending scholarly articles. In B. Cronin & C. R. Sugimoto (Eds.), Beyond Bibliometrics: Harnessing Multidimensional Indicators of Scholarly Impact (pp. 243–259). MIT Press.
Meila, M. (2015). Criteria for comparing clusterings. In C. Hennig, M. Meila, F. Murtagh, & R. Rocci (Eds.), Handbook of cluster analysis (pp. 640–657). Chapman & Hall/CRC.
Palchykov, V., Gemmetto, V., Boyarsky, A., & Garlaschelli, D. (2016). Ground truth? Concept-based communities versus the external classification of physics manuscripts. EPJ Data Science, 5(1), 28. https://doi.org/10.1140/epjds/s13688-016-0090-4.
Peel, L., Larremore, D. B., & Clauset, A. (2017). The ground truth about metadata and community detection in networks. Science Advances, 3(5), e1602548. https://doi.org/10.1126/sciadv.1602548.
Ruiz-Castillo, J., & Waltman, L. (2015). Field-normalized citation impact indicators using algorithmically constructed classification systems of science. Journal of Informetrics, 9(1), 102–117. https://doi.org/10.1016/j.joi.2014.11.010.
Shu, F., Julien, C.-A., Zhang, L., Qiu, J., Zhang, J., & Larivière, V. (2019). Comparing journal and paper level classifications of science. Journal of Informetrics, 13(1), 202–225. https://doi.org/10.1016/j.joi.2018.12.005.
Sjögårde, P., & Ahlgren, P. (2018). Granularity of algorithmically constructed publication-level classifications of research publications: Identification of topics. Journal of Informetrics, 12(1), 133–152. https://doi.org/10.1016/j.joi.2017.12.006.
Sjögårde, P., & Ahlgren, P. (2020). Granularity of algorithmically constructed publication-level classifications of research publications: Identification of specialties. Quantitative Science Studies, 1(1), 207–238. https://doi.org/10.1162/qss_a_00004.
Šubelj, L., van Eck, N. J., & Waltman, L. (2016). Clustering scientific publications based on citation relations: A systematic comparison of different methods. PLOS ONE, 11(4), e0154404. https://doi.org/10.1371/journal.pone.0154404.
van Eck, N. J., & Waltman, L. (2017). Citation-based clustering of publications using CitNetExplorer and VOSviewer. Scientometrics, 111(2), 1053–1070. https://doi.org/10.1007/s11192-017-2300-7.
Velden, T., Boyack, K. W., Gläser, J., Koopman, R., Scharnhorst, A., & Wang, S. (2017). Comparison of topic extraction approaches and their results. Scientometrics, 111(2), 1169–1221. https://doi.org/10.1007/s11192-017-2306-1.
Velden, T., Yan, S., & Lagoze, C. (2017). Mapping the cognitive structure of astrophysics by infomap clustering of the citation network and topic affinity analysis. Scientometrics, 111(2), 1033–1051. https://doi.org/10.1007/s11192-017-2299-9.
Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11, 2837–2854.
Waltman, L., Boyack, K. W., Colavizza, G., & van Eck, N. J. (2020). A principled methodology for comparing relatedness measures for clustering publications. Quantitative Science Studies, 1(2), 691–713. https://doi.org/10.1162/qss_a_00035.
Wang, S., & Koopman, R. (2017). Clustering articles based on semantic similarity. Scientometrics, 111(2), 1017–1031. https://doi.org/10.1007/s11192-017-2303-4.
Xu, S., Liu, J., Zhai, D., An, X., Wang, Z., & Pang, H. (2018). Overlapping thematic structures extraction with mixed-membership stochastic blockmodel. Scientometrics, 117(1), 61–84. https://doi.org/10.1007/s11192-018-2841-4.
Zhang, Y., Lu, J., Liu, F., Liu, Q., Porter, A., Chen, H., et al. (2018). Does deep learning help topic extraction? A kernel k-means clustering method with word embedding. Journal of Informetrics, 12(4), 1099–1117. https://doi.org/10.1016/j.joi.2018.09.004.
Acknowledgements
This research has made use of NASA’s Astrophysics Data System. We would like to thank Anastasiia Tcypina and Nikolai Schmarbeck for help with data collection. We further thank Clarivate Analytics for granting permission to use the Astro dataset which is derived from the Web of Science database. We also thank Michael J. Kurtz for his explanations of the ADS service.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Rights and permissions
About this article
Cite this article
Donner, P. Validation of the Astro dataset clustering solutions with external data. Scientometrics 126, 1619–1645 (2021). https://doi.org/10.1007/s11192-020-03780-3
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-020-03780-3