Supporting information: chemical graveyard or invaluable resource for chemical structures.

Nowadays, data supporting most publications relating to the synthesis of organic compounds is more likely than not to be found in associated “supporting information” rather than the (often page limited) article itself. For example, this article[cite]10.1021/jacs.6b13229[/cite] has an SI which is paginated at 907; almost a mini-database in its own right!^† Here I ponder whether such dissemination of data is FAIR (Findable, accessible, interoperable and re-usable).[cite]10.1038/sdata.2016.18[/cite]

I am going to use this article as my starting point.[cite]10.1038/nchembio.1340[/cite] One of the compounds synthesized is shown below; it is not explicitly discussed in the main body of the article. So how findable is it?

A search of Scifinder (Chemical abstracts) using the structure above reveals one hit, the source being the expected one.[cite]10.1038/nchembio.1340[/cite]
A search of Reaxys (used to be Beilstein) reveals no hits in their own database, but one hit is noted in …
Pubchem, where it occurs as substance 163835830. The source is again cited correctly[cite]10.1038/nchembio.1340[/cite]. One of the properties reported is the InChI key: JSLVVAICXSKSEQ-UHFFFAOYSA-N. This is the same key generated from the structure drawing programs Chemdraw or ChemDoodle.
Google on the other hand finds nothing for JSLVVAICXSKSEQ-UHFFFAOYSA-N.[cite]10.1039/B502828K[/cite]
I also tried Google Scholar but again with no luck.

So supporting information does appear to be indexed by both Chemical Abstracts and Pubchem; it is thankfully not a graveyard![cite]10.1186/s13321-016-0175-x[/cite] The chemical databases do return valuable additional information about the molecule, such as e.g. its InChI key and much else besides. Given that presumably the open PubChem resource IS indexed by Google, it must be a policy somewhere that prevents e.g. JSLVVAICXSKSEQ-UHFFFAOYSA-N from being found.

I suppose the next question might be Supporting information: chemical graveyard or invaluable resource for chemical spectra? I confess here that this post was in fact inspired by a previous one on the topic of the provenance of NMR spectra. And perhaps also with some input from the concept of sonification of spectra, in which an instrumental spectrum is converted into a sound signature to allow blind people access to such information.^‡ I wonder whether a sonified unique digital signature could be used to search for spectra, somewhat in the manner that InChI helped in tracking down (or not) the molecule above? I think it would be reasonable to say that e.g. NMR spectra as embedded in say a 907 page supporting information document are likely to be very much less FAIR[cite]10.1038/sdata.2016.18[/cite]. The solution there of course is better provenance and better metadata, as I previously mulled.

^†There are 262 numbered compounds reported in this article, many with much associated data.

^‡I cannot help but wonder what a carbonyl group sounds like!

Author

Henry Rzepa

Henry Rzepa is Emeritus Professor of Computational Chemistry at Imperial College London.
View all posts

Tags: Carbon, chemical databases, chemical graveyard, chemical spectra, Chemistry, digital signature, Nature, Organic, Organic chemistry, Organic compound, Organic food, search engines, Technology/Internet

This entry was posted on Friday, March 31st, 2017 at 2:20 pm and is filed under Chemical IT. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

3 Responses to “Supporting information: chemical graveyard or invaluable resource for chemical structures.”

Henry Rzepa says:

April 2, 2017 at 10:11 am

I came across this article: https://arxiv.org/abs/1702.00957v1 (it does not have a DOI) entitled “The Elephant in the Room of Density Functional Theory Calculations“. It relates to a very careful estimate of basis-set incompleteness errors for a test set of 211 molecules. This test set is likely to become highly useful for others investigating this phenomenon. So it was of interest to note that this test set has been thoughtfully deposited in a data repository and assigned the DOI: 10.18710/0EM0EL. Whenever I come across such a (potentially) valuable resource, I tend to ask myself “Is it FAIR”.

1. F=Findable. Well the set of 211 molecules certainly was. But the molecules themselves? Well, no. Their identity is contained in a single downloadable file called geometries.txt. The .txt rather gives it away, a free-form text file which contains 211 entries such as e.g.
```
=====================  Al2_t.xyz  ======================
Atom         x                y                z           
 Al    0.000000000000   0.000000000000  -1.350500000000
 Al    0.000000000000   0.000000000000   1.350500000000
```
But to discover what the molecule is, you have to open the file with a text editor and do what I did. 211 times!
2. A=accessible. Again, no problem for the whole dataset, but less so for the individual entries.
3. I = interoperable. Again, the text file is not expressed in any particularly standard form that would allow a program to automatically retrieve the information. One of my standard tests for (chemical) interoperability is whether Avogadro can open the file. Not this one!
4. R = re-usable. Well, I don’t know, since the metadata (actually quite extensive) does not declare under what conditions I can re-use the data. I might presume CC0 conditions (no restrictions on re-use) but cannot be certain.

So full marks to this group for making the 211 molecule dataset available. But more work to do to make it “FAIR”.

Reply
Henry Rzepa says:

April 17, 2017 at 2:31 pm

I came across this review: scholarlykitchen.sspnet.org/2017/04/11/what-constitutes-peer-review-research-data/ (the URL is self-defining) which goes through the many issues surrounding the peer-review of research data. Nominally, all supporting information in a research paper is supposed to be peer-reviewed as part of the overall review of the article itself. But if you read the above, you will see how complex such a review could get for research data itself. To give you some idea, the review should include that of the Metadata review of the dataset as well as the data itself. All for a process where reviewers give their time and expertise freely (and have to do so within a quite narrow window of time).

One is tempted to conclude that the review of research data is an ideal opportunity for AI (weak or strong artificial intelligence) to allow the process to be automated. I am sure however laws of unintended consequences are likely to emerge which would have to be factored into the workflows. But a good starting point would be to ensure that supporting information carries not just data, but FAIR data!

Reply
Mark Wilkinson says:

April 25, 2017 at 11:21 am

Hi Henry!

I enjoyed reading this very much! You may be interested to know that we (the core FAIR “geeks”) have just today established a working group that will create a set of (self-)assessment metrics for FAIRness. We intend that these can be used by community members to test their own FAIRness, and by other stakeholders (e.g. funding agencies) to evaluate the FAIRness of outputs from their supported projects.

The formal announcement of this working group should happen later today. We are already working, and are consolidating the various FAIR checklists that working group members have already independently created. If you’re interested in the outcomes from this group, please send me a note in a few months and I’ll update you!

Cheers!

Mark

Reply

Henry Rzepa's Blog