As data repositories start to flourish, it is reasonable to ask questions such as what sort of chemistry can be found there and how can I find it? Here I give an updated[1] worked example of a digital repository search for chemical content and also pose an important issue for the chemistry domain.
Firstly, I should say this search is restricted just to those data repositories that submit indexing terms (metadata) to DataCite, which is the agency that will be used to conduct the searches. Each type of metadata is defined by a prefix or operator field (much in the same way that an advanced Google search can be prefixed with an operator, e.g. author:♥). I will use just two such DataCite field prefixes† here as exemplars (there are many more).
This latter is best illustrated by one specific example of a search which I will dissect here:
https://search.datacite.org/works?query=media:chemical\/x\-gaussian*+SubjectScheme:inchikey+subject:XZYDALXOGPZGNV-UHFFFAOYSA-M+media:chemical\/x\-mnpub*‡
One hit with these restrictions has doi: 10.14469/HPC/2635 and clicking the button on the landing page for this object labelled metadata resolves to e.g.
https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/2635,
and downloads the metadata record for this object. Part of this record looks a bit like:
This brings me to the important issue for the chemistry domain, which is to agree upon a core set of SubjectSchemes for implementation in data repositories with domain-specific chemical content. The two subjects above, the InChI and the InChIKey seem obvious candidates for inclusion. But how the list is extended and how the SubjectScheme is specified are now matters for the community to discuss. Perhaps the IUPAC GoldBook is one starting point for the SubjectScheme URIs. Watch this space.
‡The \ syntax indicates an “escaped” character. Thus in chemicalx\-gaussian a \ ensured that the following / is treated as part of the search string, and not as part of the search syntax. Likewise \- ensures the minus character is part of the string and not a syntactic negation. The current list of characters requiring escaping is + - & | ! ( ) { } [ ] ^ " ~ * ? : \ /
† The documentation lists common fields, but there are far more specified in V4 of their schema. The ones you see used here are not (yet?) documented at https://search.datacite.org/help.html
♥ This Google page has a rich plethora of powerful searches, which I suggest almost no-one knows about!
In the mid to late 1990s as the Web developed, it was becoming more obvious…
I have written a few times about the so-called "anomeric effect", which relates to stereoelectronic…
The recent release of the DataCite Data Citation corpus, which has the stated aim of…
Following on from my template exploration of the Wilkinson hydrogenation catalyst, I now repeat this…
In the late 1980s, as I recollected here the equipment needed for real time molecular…
On 24th January 1984, the Macintosh computer was released, as all the media are informing…
View Comments
Yesterday, a Webinar on various aspects of FAIR data was held. Participants were encouraged to leave issues and questions on the topic on a Github forum. You can see these at https://github.com/FAIR-Data-EG/consultation/issues.
There will be more such discussions, and if you are interested, do register to participate.