Of Caterpillars and Butterflies: The Life and Afterlife of an ArXiv e-Print

ACM Web Science Conference 2012 Workshop
Evanston, IL, 21 June 2012

Vincent Larivière 1,2
Benoit Macaluso 2
Staša Milojević 3*
Cassidy R. Sugimoto 3
Mike Thelwall 4

1. École de bibliothéconomie et des sciences de l’information, Université de Montréal, C.P. 6128, Succ. Centre-Ville,Montréal, QC. H3C 3J7, Canada

2. Observatoire des sciences et des technologies (OST), Centre interuniversitaire de recherchesur la science et latechnologie (CIRST), Université du Québec à Montréal, Canada

3. School of Information and Library Science, Indiana University Bloomington, 1320 E. 10th St. Bloomington, IN 47401, USA

4. School of Technology, University of Wolverhampton, Wulfruna Street, Wolverhampton, WV1 1LY, UK

*: Corresponding author


Since its creation by Paul Ginsparg in 1991, arXiv has become central to the diffusion of research in a number of fields—most notably in physics, mathematics, and computer science. Previous researchers have studied aspects of this successful repository focusing on use (Brown, 2001), ordering and citation rates (Haque & Ginsparg, 2009), coexistence of e-prints and journals (Henneken et al., 2007) and the effect of arXiv on citation rates (Moed, 2007). However, previous literature has failed to comprehensively integrate data from arXiv and Web of Science (WoS) in order to enhance our understanding of the new ecology of scholarly communication. This paper uses combined data to investigate publication delays, aging characteristics and scientific impact of arXiv e-prints and their published alter egos on the entire database and a series of micro-analyses on the astrophysics domain.


This paper uses two data sources: the arXiv database and WoS. All arXiv database metadata from 1990 to March 22, 2012 were downloaded (N = 744,583 e-prints). All standard citation indexes were used for WoS (Science Citation Index Expanded, Social Science Citation Index and Arts and Humanities Citation Index) for the 1990-2011 period. Data are presented for 1995—2010 (although citations are compiled until the end of 2011). Two types of links between the data sources were created: (a) between the arXiv e-print and its published version indexed in WoS and (b) between the arXiv e-print and its citation in WoS. The link was created using(a) a fuzzy match between the title of the e-print and the title of the WoS, as well as the first author. Additional matching was performed usingt he journal field of arXiv and lowering the threshold of similarity between titles of e-prints and published papers. For the second matching (b) we utilized a specific structure of the references to the arXiv e-prints in WoS. For example, a reference to an e-print from the condensed matter section of arXiv will have the string ‘CONDMAT’ followed by the series of seven or eight digits that correspond to its document ID in the online e-print database. Given that a paper belonging to more than one arXiv category can be cited using both categories as prefixes, but retains a unique ID, the matching process only used the seven or eight digits. For the astrophysics domain, we separated documents include four categories: 1) arXiv e-prints never published in a journal, 2) arXiv e-prints published in a journal, 3) journal articles also published as an arXiv e-print and 4) journal articles that were never published as arXiv e-prints.

Results and Discussion

Figure 1 shows that the delay between the submission of the manuscript on arXiv and publication in a peer-reviewed journal has reduced over time. Whereas papers were once published a year after appearing on arXiv, publication in a journal is now likely to occur in the same year as the appearance on arXiv. There are two possible reasons for this: 1) a higher proportion of researchers are now waiting for the paper to be published or accepted for publication before submitting to arXiv or 2) the introduction of arXiv may have prompted publishers to decrease publication delays.

We observe the initial increase in the proportion of arXiv submissions that are also published in a WoS-indexed peer-reviewed journal. Slightly less than 50% of arXiv submissions were also found in WoS. The small decrease visible for the last couple of years is likely due to documents that have a longer delay between arXiv submission and publication.

Figure 1. Distribution of the delay between arXiv submission and publication year, by year of submission to arXiv, 1995-2010. Inset: proportion of arXiv submissions that are also published in a WoS-indexed peer-reviewed journal.

Figure 2 presents (A) the trends in the numbers of papers that have appeared on arXiv only, on arXiv and WoS (arXiv version), only in WoS and on arXiv and WoS (WoS version) and (B) the mean number of citations these documents have received using a one-year citation window plus publication year. We see an increase in the number of documents published both in arXiv and in journals, and increase in the number of papers published only in arXiv, and a decline of papers published only in journals. The citation rates among the four groups vary over time. WoS versions of arXiv e-prints obtain the highest citation rates, a finding consistent with the documented association between open access and citation. However, this impact is decreasing and is approaching that of other WoS papers not submitted to arXiv, whose mean impact is increasing .There is no difference in the impact of the arXiv versions of both published and unpublished papers. One could have expected that these unpublished papers, being non-refereed, would have a lower impact. However, it is possible that researchers prefer to cite the published version of an e-print which is likely to reduce published e-print impact and, hence, make the two measures comparable.

Figure 2. A) Number of documents published and B) mean number of citation received (publication year plus one year), for documents published on arXiv only, on arXiv and WoS (arXiv version), only in WoS and on arXiv and WoS (WoS version), 1995-2010

Figure 3 presents the age distribution of citations received by the four groups of documents. It shows that e-prints and published papers follow different patterns. E-prints citations peak on the year following their submission, while citations to papers are similar during the two years following publication. The decline is much faster for e-prints, with a small proportion of citations (less than 5%) received past the fifth year following publication. Given the transfer of citations from to be published e-prints to their published version, the citations to their e-print versions decay faster than those received by unpublished e-prints.

Figure 3. Percentage of citations received, for documents published on arXiv only, on arXiv and WoS (arXiv version), only in WoS and on arXiv and WoS (WoS version), 1995-2010

Conclusion and Future Research

This paper demonstrates that the average delay between submission to arXiv and publication in a WoS-indexed peer-reviewed journal has decreased. For astrophysics, the number of papers appearing on arXiv is increasing. This can be an indicator of changing role arXiv plays within this community. The role of the arXiv has moved from the space for sharing pre-prints by a minority, to the place for archiving the majority of produced research. This finding can be further supported by the very similar scientific impact of papers appearing in arXiv and journals. Future work will explore the potential of these combined datasets for understanding the thematic and temporal relationships between these genres and how individuals interact in this space.

Cited References

Brown, C. (2001). The E-volution of Preprints in the Scholarly Communication of Physicists and Astronomers. Journal of the American Society for Information Science and Technology, 52(3):187–200

Haque, A.U., & Ginsparg, P. (2009). Positional Effects on Citation and Readership in arXiv. Journal of the American Society for Information Science and Technology, 60(11):2203–2218,

Henneken, E.A.,Kurtz, M.J., Eichhorn, G., Accomazzi, A., Grant, C.S., Thompson, D., Bohlen, E.,& Murray, S.S. (2007). E-prints and journal articles in astronomy: a productive co-existence. Learned Publishing, 20(1): 16–22

Moed, H.F. (2007). The Effect of “Open Access” on Citation Impact: An Analysis of ArXiv’s Condensed Matter Section. Journal of the American Society for Information Science and Technology, 58(13):2047–2054

Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

One Comment

  1. Posted February 28, 2017 at 2:58 am | Permalink

    Hello my name is Sabrina Warren and I just wanted to send you a quick message here instead of calling you. I discovered your Of Caterpillars and Butterflies: The Life and Afterlife of an ArXiv e-Print – altmetrics.org page and noticed you could have a lot more traffic. I have found that the key to running a popular website is making sure the visitors you are getting are interested in your subject matter. There is a company that you can get keyword targeted traffic from and they let you try the service for free for 7 days. I managed to get over 300 targeted visitors to day to my site. http://acortarurl.es/5i

Post a Comment

Your email is never shared. Required fields are marked *