Validation of the Astro dataset clustering solutions with external data

Donner, Paul

doi:10.1007/s11192-020-03780-3

Validation of the Astro dataset clustering solutions with external data

Published: 21 November 2020

Volume 126, pages 1619–1645, (2021)
Cite this article

Scientometrics Aims and scope Submit manuscript

Paul Donner ORCID: orcid.org/0000-0001-5737-8483¹

370 Accesses
4 Citations
Explore all metrics

Abstract

We conduct an independent cluster validation study on published clustering solutions of a research testbed corpus, the Astro dataset of publication records from astronomy and astrophysics. We extend the dataset by collecting external validation data serving as proxies for the latent structure of the corpus. Specifically, we collect (1) grant funding information related to the publications, (2) data on topical special issues, (3) on specific journals’ internal topic classifications and (4) usage data from the main online bibliographic database of the discipline. The latter three types of data are newly introduced for the purpose of clustering validation and the rationale for using them for this task is set out. We find that one solution based on the global citation network achieves better results than the competitors across three validation data sources but that another solution based on bibliographic coupling performs best on the special issues data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Normalization of direct citations for clustering in publication-level networks: evaluation of six approaches

Article Open access 23 January 2024

Citation-based clustering of publications using CitNetExplorer and VOSviewer

Article Open access 27 February 2017

A Literature Review on Correlation Clustering: Cross-disciplinary Taxonomy with Bibliometric Analysis

Article 03 September 2022

Data availability

The original validation data is published at https://zenodo.org/record/4061694.

Notes

See http://www.topic-challenge.info.
http://141.20.126.171/solutions.html.
http://54xushuo.net/wiki/lib/exe/fetch.php?media=resources:datasets:xlza_2018.zip.
The reason for this result could be that these two solutions, eb and en, unlike the other ones, do not rely on direct citation, which is likely to be relatively rare between papers of a single special issue, but on bibliographic coupling and NLP-enhanced text similarity.

References

Ahlgren, P., Chen, Y., Colliander, C., & van Eck, N. J. (2020). Enhancing direct citations: A comparison of relatedness measures for community detection in a large set of PubMed publications. Quantitative Science Studies, 1(2), 714–729. https://doi.org/10.1162/qss_a_00027.
Article Google Scholar
Bollen, J., Van de Sompel, H., Hagberg, A., Bettencourt, L., Chute, R., Rodriguez, M. A., et al. (2009). Clickstream data yields high-resolution maps of science. PLoS One, 4(3), e4803. https://doi.org/10.1371/journal.pone.0004803.
Article Google Scholar
Boyack, K. W. (2017). Investigating the effect of global data on topic detection. Scientometrics, 111(2), 999–1015. https://doi.org/10.1007/s11192-017-2297-y.
Article Google Scholar
Boyack, K. W., & Klavans, R. (2010). Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61(12), 2389–2404. https://doi.org/10.1002/asi.21419.
Article Google Scholar
Boyack, K.W., Newman, D., Duhon, R.J., Klavans, R., Patek, M., Biberstine, J. R., & Börner, K. (2011). Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches. PloS ONE, 6(3). https://doi.org/10.1371/journal.pone.0018029
Glänzel, W., & Thijs, B. (2017). Using hybrid methods and ’core documents’ for the representation of clusters and topics: The astronomy dataset. Scientometrics, 111(2), 1071–1087. https://doi.org/10.1007/s11192-017-2301-6.
Article Google Scholar
Gläser, J., Glänzel, W., & Scharnhorst, A. (2017). Same data-different results? Towards a comparative approach to the identification of thematic structures in science. Scientometrics, 111(2), 981–998. https://doi.org/10.1007/s11192-017-2296-z.
Article Google Scholar
Halkidi, M., Vazirgiannis, M., & Hennig, C. (2015). Method-independent indices for cluster validation and estimating the number of clusters. In C. Hennig, M. Meila, F. Murtagh, & R. Rocci (Eds.), Handbook of cluster analysis (pp. 616–639). Chapman & Hall/CRC.
Havemann, F., Gläser, J., & Heinz, M. (2017). Memetic search for overlapping topics based on a local evaluation of link communities. Scientometrics, 111(2), 1089–1118. https://doi.org/10.1007/s11192-017-2302-5.
Article Google Scholar
Klavans, R., & Boyack, K. W. (2017). Which type of citation analysis generates the most accurate taxonomy of scientific and technical knowledge? Journal of the Association for Information Science and Technology, 68(4), 984–998. https://doi.org/10.1002/asi.23734.
Article Google Scholar
Koopman, R., Wang, S., & Scharnhorst, A. (2017). Contextualization of topics: Browsing through the universe of bibliographic information. Scientometrics, 111(2), 1119–1139. https://doi.org/10.1007/s11192-017-2303-4.
Article Google Scholar
Kurtz, M. J., Eichhorn, G., Accomazzi, A., Grant, C., Demleitner, M., & Murray, S. S. (2005). Worldwide use and impact of the NASA Astrophysics Data System digital library. Journal of the American Society for Information Science and Technology, 56(1), 36–45. https://doi.org/10.1002/asi.20095.
Article Google Scholar
Kurtz, M. J., Eichhorn, G., Accomazzi, A., Grant, C. S., & Murray, S. S. (2002). Second-order bibliometric operators in the Astrophysics Data System. Astronomical Data Analysis II, 4847, 238–245. https://doi.org/10.1117/12.460438.
Article Google Scholar
Kurtz, M. J., & Henneken, E. A. (2014). Finding and recommending scholarly articles. In B. Cronin & C. R. Sugimoto (Eds.), Beyond Bibliometrics: Harnessing Multidimensional Indicators of Scholarly Impact (pp. 243–259). MIT Press.
Meila, M. (2015). Criteria for comparing clusterings. In C. Hennig, M. Meila, F. Murtagh, & R. Rocci (Eds.), Handbook of cluster analysis (pp. 640–657). Chapman & Hall/CRC.
Palchykov, V., Gemmetto, V., Boyarsky, A., & Garlaschelli, D. (2016). Ground truth? Concept-based communities versus the external classification of physics manuscripts. EPJ Data Science, 5(1), 28. https://doi.org/10.1140/epjds/s13688-016-0090-4.
Article Google Scholar
Peel, L., Larremore, D. B., & Clauset, A. (2017). The ground truth about metadata and community detection in networks. Science Advances, 3(5), e1602548. https://doi.org/10.1126/sciadv.1602548.
Article Google Scholar
Ruiz-Castillo, J., & Waltman, L. (2015). Field-normalized citation impact indicators using algorithmically constructed classification systems of science. Journal of Informetrics, 9(1), 102–117. https://doi.org/10.1016/j.joi.2014.11.010.
Article Google Scholar
Shu, F., Julien, C.-A., Zhang, L., Qiu, J., Zhang, J., & Larivière, V. (2019). Comparing journal and paper level classifications of science. Journal of Informetrics, 13(1), 202–225. https://doi.org/10.1016/j.joi.2018.12.005.
Article Google Scholar
Sjögårde, P., & Ahlgren, P. (2018). Granularity of algorithmically constructed publication-level classifications of research publications: Identification of topics. Journal of Informetrics, 12(1), 133–152. https://doi.org/10.1016/j.joi.2017.12.006.
Article Google Scholar
Sjögårde, P., & Ahlgren, P. (2020). Granularity of algorithmically constructed publication-level classifications of research publications: Identification of specialties. Quantitative Science Studies, 1(1), 207–238. https://doi.org/10.1162/qss_a_00004.
Article Google Scholar
Šubelj, L., van Eck, N. J., & Waltman, L. (2016). Clustering scientific publications based on citation relations: A systematic comparison of different methods. PLOS ONE, 11(4), e0154404. https://doi.org/10.1371/journal.pone.0154404.
Article Google Scholar
van Eck, N. J., & Waltman, L. (2017). Citation-based clustering of publications using CitNetExplorer and VOSviewer. Scientometrics, 111(2), 1053–1070. https://doi.org/10.1007/s11192-017-2300-7.
Article Google Scholar
Velden, T., Boyack, K. W., Gläser, J., Koopman, R., Scharnhorst, A., & Wang, S. (2017). Comparison of topic extraction approaches and their results. Scientometrics, 111(2), 1169–1221. https://doi.org/10.1007/s11192-017-2306-1.
Article Google Scholar
Velden, T., Yan, S., & Lagoze, C. (2017). Mapping the cognitive structure of astrophysics by infomap clustering of the citation network and topic affinity analysis. Scientometrics, 111(2), 1033–1051. https://doi.org/10.1007/s11192-017-2299-9.
Article Google Scholar
Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11, 2837–2854.
MathSciNet MATH Google Scholar
Waltman, L., Boyack, K. W., Colavizza, G., & van Eck, N. J. (2020). A principled methodology for comparing relatedness measures for clustering publications. Quantitative Science Studies, 1(2), 691–713. https://doi.org/10.1162/qss_a_00035.
Article Google Scholar
Wang, S., & Koopman, R. (2017). Clustering articles based on semantic similarity. Scientometrics, 111(2), 1017–1031. https://doi.org/10.1007/s11192-017-2303-4.
Article Google Scholar
Xu, S., Liu, J., Zhai, D., An, X., Wang, Z., & Pang, H. (2018). Overlapping thematic structures extraction with mixed-membership stochastic blockmodel. Scientometrics, 117(1), 61–84. https://doi.org/10.1007/s11192-018-2841-4.
Article Google Scholar
Zhang, Y., Lu, J., Liu, F., Liu, Q., Porter, A., Chen, H., et al. (2018). Does deep learning help topic extraction? A kernel k-means clustering method with word embedding. Journal of Informetrics, 12(4), 1099–1117. https://doi.org/10.1016/j.joi.2018.09.004.
Article Google Scholar

Download references

Acknowledgements

This research has made use of NASA’s Astrophysics Data System. We would like to thank Anastasiia Tcypina and Nikolai Schmarbeck for help with data collection. We further thank Clarivate Analytics for granting permission to use the Astro dataset which is derived from the Web of Science database. We also thank Michael J. Kurtz for his explanations of the ADS service.

Author information

Authors and Affiliations

Deutsches Zentrum für Hochschul- und Wissenschaftsforschung, Schützenstraße 6a, 10117, Berlin, Germany
Paul Donner

Authors

Paul Donner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paul Donner.

Appendix

Table 9 Topical sections of four astronomy journals with occurrence counts

Full size table

Table 10 Three best- and worst-performing clusters per solution (only values for clusters for which all four true positive ratios could be calculated)

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Donner, P. Validation of the Astro dataset clustering solutions with external data. Scientometrics 126, 1619–1645 (2021). https://doi.org/10.1007/s11192-020-03780-3

Download citation

Received: 14 July 2020
Published: 21 November 2020
Issue Date: February 2021
DOI: https://doi.org/10.1007/s11192-020-03780-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Validation of the Astro dataset clustering solutions with external data

Abstract

Access this article

Similar content being viewed by others

Normalization of direct citations for clustering in publication-level networks: evaluation of six approaches

Citation-based clustering of publications using CitNetExplorer and VOSviewer

A Literature Review on Correlation Clustering: Cross-disciplinary Taxonomy with Bibliometric Analysis

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Validation of the Astro dataset clustering solutions with external data

Abstract

Access this article

Similar content being viewed by others

Normalization of direct citations for clustering in publication-level networks: evaluation of six approaches

Citation-based clustering of publications using CitNetExplorer and VOSviewer

A Literature Review on Correlation Clustering: Cross-disciplinary Taxonomy with Bibliometric Analysis

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation