Raw data and the evolution of crystallographic FAIR data. Journals, processed and raw structure data.

In my previous post on the topic, I introduced the concept that data can come in several forms, most commonly as “raw” or primary data and as a “processed” version of this data that has added value. In crystallography, the chemist is interested in this processed version, carried by a CIF file. However on rare occasions when a query arises about the processed component, this can in principle at least be resolved by taking a look at the original raw data, expressed as diffraction images. I established with much appreciated help from CCDC that since 2016, around 65 datasets in the CSD (Cambridge structural database) have appeared with such associated raw data. The problem is easily reconciling the two sets of data (the raw data is not stored on CSD) and one way of doing this is via the metadata associated with the datasets. In turn, if this metadata is suitably registered, one can query the metadata store for such associations, as was illustrated in the previous post on the topic. Here I explore the metadata records for five of these 65 sets to find out their properties, selected to illustrate the five data repositories thus far that host such data for compounds in the CSD database.

Raw data
repository
Raw Data
DOI
Raw data
→CSD?
CSD→
Raw data?
⇐Journal⇒
Zenodo 10.5281/zenodo.4271549 No No 10.1039/C6RA28567H
Imperial College research data repository 10.14469/hpc/2298 Yes Yes 10.1021/acsomega.7b00482
RepoD, a Harvard Dataverse instance 10.18150/repod.6628285 No No 10.1021/acs.cgd.0c01252
Cambridge university repository 10.17863/CAM.21968 No No 10.1016/j.inoche.2018.08.024
Isis neutron and muon source data journal 10.5286/ISIS.E.RB1620465 No No 10.1039/D0CC02418J

Ideally, one is looking for bidirectional links between the data as expressed in the metadata and in both directions. As you can see from the above, these links are present in only one of the five sets. More common is that both the raw and the processed data will contain links to the journal article where the data is discussed. Very much less commonly are there links from the journal article to the raw data, although such links are slightly more likely to exist from the journal to the processed data. If you click on the link in any of the last three columns, a copy of the metadata will download for you to inspect. There you can verify if the assertions made above are correct. 

What the metadata records demonstrate above is a very small scale so-called PID graph (DOI: [1] 10.5438/jwvf-8a66) where each DOI is a node in that graph and if a connection exists, it is shown by a line connecting the nodes. The PID graph can be extended to include a third type of node, the journal article and then it starts to get interesting! I will investigate if I can generate the PID graph for the above, although be prepared, it will not (yet) contain very many lines between nodes!

References

    Henry Rzepa

    Henry Rzepa is Emeritus Professor of Computational Chemistry at Imperial College London.

    View Comments

    Recent Posts

    Internet Archeology: reviving a 2001 article published in the Internet Journal of Chemistry.

    In the mid to late 1990s as the Web developed, it was becoming more obvious…

    1 month ago

    Detecting anomeric effects in tetrahedral carbon bearing four oxygen substituents.

    I have written a few times about the so-called "anomeric effect", which relates to stereoelectronic…

    1 month ago

    Data Citation – a snapshot of the chemical landscape.

    The recent release of the DataCite Data Citation corpus, which has the stated aim of…

    2 months ago

    Mechanistic templates computed for the Grubbs alkene-metathesis reaction.

    Following on from my template exploration of the Wilkinson hydrogenation catalyst, I now repeat this…

    2 months ago

    3D Molecular model visualisation: 3 Million atoms +

    In the late 1980s, as I recollected here the equipment needed for real time molecular…

    3 months ago

    The Macintosh computer at 40.

    On 24th January 1984, the Macintosh computer was released, as all the media are informing…

    3 months ago