Data Citation – a snapshot of the chemical landscape.

The recent release of the DataCite Data Citation corpus, which has the stated aim of providing “a trusted central aggregate of all data citations to further our understanding of data usage and advance meaningful data metrics” made me want to investigate what the current state of citing data in the area of chemistry might be. Chemistry is known to be a “data rich” science (as most of the physical sciences are) and  here on this very blog I try to cite whenever possible the source(s) of the data that  I often use when discussing a topic. Such citations are not necessarily the same as citing a journal source via e.g. its DOI, although of course one is very likely to find data associated with most articles nowadays, albeit almost entirely via any associated supporting information document. However the latter is often presented in a relatively unstructured (PDF) form, which does not adhere to what are called the “FAIR” guidelines of being findable, accessible, interoperable and reusable. Directly citing data is a way of improving its FAIR-characteristics. So what insights does the Data citation corpus reveal?

  1. This overview shows that by far the most common mechanism for citing data is via its Accession Number, used predominantly by Life Sciences (an example of this latter is linked here[1]), with the DOI (digital object identifier) being less common.
  2. Tunnelling down to citation counts in chemical sciences by publisher, an odd picture emerges with just a handful of citations.
  3. The more general physical sciences does not fare much better:
  4. Lets try a different approach, filtering by repository. Thus here are the statistics for the Cambridge crystallographic data centre, which was citing data in large amounts a few years back, but which appears to have dropped off in the last few years. Given that the entries there continue to go up almost exponentially, we begin to suspect that the data citations there are not being properly recognised as such by the citation corpus.
  5. Lets try another repository, Zenodo, which again is dropping but where the totals are about 500 a year for the most recent.
  6. OK, one more go, the RSC chemistry publisher.

I am not sure what to make of this; areas where you would expect very high levels of data citation in chemical sciences do not appear to exist – I think for some reason, the DataCite citation corpus is not yet capturing them.[2] But when things do start operating as perhaps expected, I think we will have a very valuable resource, which should firmly put data (whether FAIR or not) on the map.


  1. D. Batista, A. Gonzalez-Beltran, S. Sansone, and P. Rocca-Serra, "Machine actionable metadata models", Scientific Data, vol. 9, 2022.
  2. R. Page, "Problems with the DataCite Data Citation Corpus", 2024.

Leave a Reply