News articles

How do you cite data and code?

Citing code and data provides a solid base for the research. We explain what you should consider.

Image: Simplified Pixabay Licence

The data and the code we use (and/or develop) and process are crucial elements in scientific practice. However, there are still problems and doubts about how and why to cite them. We would herewith like to give you some hints.

Why cite data and code?

In general, citing data helps us to:

  • Prevent scientific misconduct. For instance, results from fabricated or “massaged” data.
  • Give credit to other data producers. Citing data is also an indicator that we have reused data from third parties. This means we built upon other’s research, which leads to an acceleration of the research process. 
  • Show solid basis of our own research. The data serve as a foundation of our results and can be verified by others at any point in time.
  • Enable reproducibility and reuse, which are the core of the scientific method. Apart from being citable (and cited), for data to be reusable, they should be provided with detailed documentation and stored in a durable, interoperable format in a reliable datarepository.

 

Make your data and code citable

To cite data, we need to provide data and code with an identification. In other words, assign a persistent identifier, opens an external URL in a new window (PID) to the data or code. There are many types of PIDs, but probably the most accepted for scientific results is DOI.

Many data repositories mint DOIs for the uploaded datasets. Some examples are the generalist repository Zenodo, opens an external URL in a new window, the Austrian social science data repository AUSSDA, opens an external URL in a new window or the earth and environmental sciences data repository PANGAEA, opens an external URL in a new window. In the data repository registry re3data, opens an external URL in a new window, it is also possible to query, through search filters, repositories that use DOIs to identify datasets.

In the case of code, as long as it is stored in a GitHub repository, we can get a DOI by connecting the GitHub repository with Zenodo, opens an external URL in a new window. One positive aspect of the GitHub/Zenodo integration is that a new DOI will be assigned to the version or release of a project repository. This facilitates version controlling and guarantees the proper identification of each version.

Check Module 5 of the Open Science MOOC (Open Research Software and Open Source, opens an external URL in a new window) for more information about the use of GitHub and Zenodo.

Another possibility to make code citable, and offer complementary documentation, is to publish the code through a “code journal”. By "code journal“, we mean a scholarly journal (with ISSN) that guarantees the quality of the software submitted through a peer-review process. An example of this type of journal is the Journal of Open Source Software, opens an external URL in a new window.

 

Data citation principles

The FORCE 11 Data Citation Synthesis Group published in 2014 a declaration of Data Citation Principles. These are:

  1. Importance
  2. Credit and Attribution
  3. Evidence
  4. Unique Identification
  5. Access
  6. Persistence
  7. Specificity and Verifiability
  8. Interoperability and Flexibility

These principles should be reflected in how we cite a data set in a scientific text.

Author(s), Year, Dataset title, Data Repository or Archive, Version, Global Persistent Identifier.

  • Naming the authors reflects principle 2, attribution is given to all contributors to the data. So does naming the data repository or archive.
  • Adding the version reflects principle 7 and facilitates the identification of specific data (including the specific time interval or granular portion of the data retrieved). This is particularly important when dealing with dynamic data.
  • Using a global persistent identifier (e.g. DOI) reflects principles 4, 5 and 6 and provides not only access to the dataset but also the corresponding metadata.

Example:

Jang, Kyoung-Soon; Park, Ki-Tae (2019): Chemical characteristics of the assigned elemental formulas from the FT-ICR MS data of Arctic aerosol-derived organic matter collected in Ny-Ålesund in May 2015. PANGAEA, version 1.0, https://doi.org/10.1594/PANGAEA.905595, opens an external URL in a new window

 

Code citation principles

As with the data citation principles, FORCE 11 also proposes principles for software citation:

  1. Importance
  2. Credit and Attribution
  3. Unique Identification
  4. Persistence
  5. Accessibility
  6. Specificity

Example:

Zhao, Junbin. (2019, November 21). FluxCalR: a R package for calculating CO2 and CH4 fluxes from static chambers (Version 0.2.0). Zenodo. http://doi.org/10.5281/zenodo.3549398, opens an external URL in a new window

Which refers to this GitHub repository: https://github.com/junbinzhao/FluxCalR/, opens an external URL in a new window

Further reading

  1. Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. Martone M. (ed.) San Diego CA: FORCE11; 2014 https://doi.org/10.25490/a97f-egyk
  2. Smith, Arfon M., Daniel S. Katz, and Kyle E. Niemeyer. ‘Software Citation Principles’. PeerJ Computer Science 2 (19 September 2016): e86. https://doi.org/10/bw3g.
  3. Starr J, Castro E, Crosas M, Dumontier M, Downs RR, Duerr R, Haak LL, Haendel M, Herman I, Hodson S, Hourclé J, Kratz JE, Lin J, Nielsen LH, Nurnberger A, Proell S, Rauber A, Sacchi S, Smith A, Taylor M, Clark T. 2015. Achieving human and machine accessibility of cited data in scholarly publications. PeerJ Computer Science 1:e1 https://doi.org/10.7717/peerj-cs.1
  4. Rauber, Andreas; Ari Asmi; Dieter van Uytvanck; Stefan Proell (2015): Data Citation of Evolving Data: Recommendations of the Working Group on Data Citation (WGDC). https://doi.org/10.15497/RDA00016    

Contact

Paloma Marín Arraiza

TU Wien Bibliothek

Twitter: @RDMTUWien, opens an external URL in a new window