A search of some major chemistry publishers for FAIR data records.

In recent years, findable data has become ever more important (the F in FAIR). Here I test that F using the DataCite search service.

Firstly an introduction to this service. This is a metadata database about datasets and other research objects. One of the properties is relatedIdentifier which records other identifiers associated with the dataset, being say the DOI of any published article associated with the data, but it could also be pointers to related datasets.

One can query thus:

  1. https://search.datacite.org/works?query=relatedIdentifiers.relatedIdentifier:*
    which retrieves the very healthy looking 6,179,287 works.
  2. One can restrict this to a specific publisher by the DOI prefix assigned to that publisher:
    ?query=relatedIdentifiers.relatedIdentifier:10.1021*
    which returns a respectable 210,240 works.
  3. It turns out that the major contributor to FAIR currently are crystal structures from the CCDC. One can remove them from the search to see what is left over:
    ?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+NOT+(identifier:*10.5517*) 
    and one is down to 14,213 works, of which many nevertheless still appear to be crystal structures. These may be links to other crystal datasets.

I have performed searches 2 and 3 for some popular publishers of chemistry (the same set that were analysed here).

Publisher Search 2 Search 3
ACS 210,240 14,213
RSC 138,147 1,279
Elsevier 185,351 56,373
Nature 12,316 8,104
Wiley 135,874 9,283
Science 3,384 2,343

These publishers all have significant numbers of datasets which at least accord with the F of FAIR. A lot of data sets may not have metadata which in fact points back to a published article, since this can be something that has to be done only when the DOI of that article appears, in other words AFTER the publication of the dataset. So these numbers are probably low rather than high.

How about the other way around? Rather than datasets that have a journal article as a related identifier, we could search for articles that have a dataset as a related identifier?

  1. ?query=(identifier:*10.1039*)+AND+(relatedIdentifiers.relatedIdentifier:*)
    returns rather mysterious nothing found. It might also be that there is no mapping of this search between the CrossRef and DataCite metadata schemas.
  2. And just to show the searches are behaving as expected:
    ?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+AND+(identifier:*10.5517*)
    returns 196,027 works.

It will also be of interest to show how these numbers change over time. Is there an exponential increase? We shall see.

Finally, we have not really explored adherence to eg the AIR of FAIR.  That is for another post.

Henry Rzepa

Henry Rzepa is Emeritus Professor of Computational Chemistry at Imperial College London.

View Comments

Recent Posts

Detecting anomeric effects in tetrahedral boron bearing four oxygen substituents.

In an earlier post, I discussed a phenomenon known as the "anomeric effect" exhibited by…

2 hours ago

Internet Archeology: reviving a 2001 article published in the Internet Journal of Chemistry.

In the mid to late 1990s as the Web developed, it was becoming more obvious…

1 month ago

Detecting anomeric effects in tetrahedral carbon bearing four oxygen substituents.

I have written a few times about the so-called "anomeric effect", which relates to stereoelectronic…

1 month ago

Data Citation – a snapshot of the chemical landscape.

The recent release of the DataCite Data Citation corpus, which has the stated aim of…

2 months ago

Mechanistic templates computed for the Grubbs alkene-metathesis reaction.

Following on from my template exploration of the Wilkinson hydrogenation catalyst, I now repeat this…

2 months ago

3D Molecular model visualisation: 3 Million atoms +

In the late 1980s, as I recollected here the equipment needed for real time molecular…

3 months ago