Skip to main content
Log in

How many preprints have actually been printed and why: a case study of computer science preprints on arXiv

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Preprints play an increasingly critical role in academic communities. There are many reasons driving researchers to post their manuscripts to preprint servers before formal submission to journals or conferences, but the use of preprints has also sparked considerable controversy, especially surrounding the claim of priority. In this paper, a case study of computer science preprints submitted to arXiv from 2008 to 2017 is conducted to quantify how many preprints have eventually been printed in peer-reviewed venues. Among those published manuscripts, some are published under different titles and without an update to their preprints on arXiv. In the case of these manuscripts, the traditional fuzzy matching method is incapable of mapping the preprint to the final published version. In view of this issue, we introduce a semantics-based mapping method with the employment of Bidirectional Encoder Representations from Transformers (BERT). With this new mapping method and a plurality of data sources, we find that 66% of all sampled preprints are published under unchanged titles and 11% are published under different titles and with other modifications. A further analysis was then performed to investigate why these preprints but not others were accepted for publication. Our comparison reveals that in the field of computer science, published preprints feature adequate revisions, multiple authorship, detailed abstract and introduction, extensive and authoritative references and available source code.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. https://arxiv.org/.

  2. https://arxiv.org/stats/monthly_submissions.

  3. https://www.biorxiv.org/.

  4. https://www.ssrn.com/.

  5. https://hcommons.org/.

  6. https://www.preprints.org/.

  7. MDPI is an acronym referring two related organizations, Molecular Diversity Preservation International and Multidisciplinary Digital Publishing Institute.

  8. https://arxiv.org/help/stats.

  9. https://arxiv.org/help/api.

  10. https://arxiv.org/help/bulk_data.

  11. https://www.crossref.org/.

  12. https://github.com/CrossRef/rest-api-doc.

  13. https://dblp.org/.

  14. https://dblp.org/xml/.

  15. https://scholar.google.com/.

  16. https://paperswithcode.com/.

  17. https://arxiv.org/help/bib_feed.

  18. https://arxiv.org/help/jref.

  19. We originally added abstract pairs (preprint, candidate) to our dataset, but we found that only a little part of data of Crossref had abstracts when the model had actually been used, so the abstract pairs were removed in our model.

  20. arXiv:1901.07213.

  21. Preprints which fall under this condition have statements like “submitted to (a certain journal)” or “submitted to (a certain conference)” included in their metadata. However, up until this paper is written, no corresponding records can be found in any journal or conference proceedings. The results are further confirmed with the method presented in "Methods" section. Therefore, we reach the conclusion that these preprints submitted fail to be accepted.

  22. Full names of categories on arXiv are attached in "Appendix 1" for reference.

  23. Data in "What preprints can be printed" section were tested with the same methods and we came to the same conclusion.

  24. https://github.com/allenai/science-parse.

  25. https://arxiv.org/help/replace.

  26. There were 53 preprints without tables among these 100 records and we indeed found that a small part of the automatically parsing values were smaller than the manually calculated ones.

References

Download references

Acknowledgements

We are grateful to Yingmin Wang and Xingchen He for their assistance in processing data for this research. We appreciated Ziyi Chen for her help of statistical analysis. We also thank two anonymous reviewers for their insightful comments. Special and heartfelt gratitude goes to the first author’s wife Fenmei Zhou, for her understanding and love. Her unwavering support and continuous encouragement enable this research to be possible.

Funding

This work is partly sponsored by the State Language Commission of China through the 13th Five-Year Plan project Artificial Intelligence and Language (Grant No. WT135-38).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaodong Shi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Appendix 1: Abbreviation—full name of arXiv categories

Appendix 1: Abbreviation—full name of arXiv categories

Abbr.

Full name

Abbr.

Full name

CS.AI

Artificial Intelligence

CS.IR

Information Retrieval

CS.AR

Hardware Architecture

CS.IT

Information Theory

CS.CC

Computational Complexity

CS.LG

Machine Learning

CS.CE

Computational Engineering, Finance, and Science

CS.LO

Logic in Computer Science

CS.CG

Computational Geometry

CS.MA

Multiagent Systems

CS.CL

Computation and Language

CS.MM

Multimedia

CS.CR

Cryptography and Security

CS.MS

Mathematical Software

CS.CV

Computer Vision and Pattern Recognition

CS.NA

Numerical Analysis

CS.CY

Computers and Society

CS.NE

Neural and Evolutionary Computation

CS.DB

Databases

CS.NI

Networking and Internet Architecture

CS.DC

Distributed, Parallel, and Cluster Computing

CS.OH

Other

CS.DL

Digital Libraries

CS.OS

Operating Systems

CS.DM

Discrete Mathematics

CS.PF

Performance

CS.DS

Data Structures and Algorithms

CS.PL

Programming Languages

CS.ET

Emerging Technologies

CS.RO

Robotics

CS.FL

Formal Languages and Automata Theory

CS.SC

Symbolic Computation

CS.GL

General Literature

CS.SD

Sound

CS.GR

Graphics

CS.SE

Software Engineering

CS.GT

Computer Science and Game Theory

CS.SI

Social and Information Networks

CS.HC

Human–Computer Interaction

CS.SY

Systems and Control

EESS

Electrical Engineering and Systems Science

  

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, J., Yu, Y., Zhou, Y. et al. How many preprints have actually been printed and why: a case study of computer science preprints on arXiv. Scientometrics 124, 555–574 (2020). https://doi.org/10.1007/s11192-020-03430-8

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-020-03430-8

Keywords

Navigation