A cascading tutorial in finding rich NMR data using the Datacite datasearch engine.

In the previous post, I introduced three of a new generation of search engines specialising in the discovery of data. Data has some special features which make its properties slightly different from the conceptual (or natural language) searches we are used to performing for general information and so a search engine specifically for data is invariably going to reflect this. At the simplest level, the data search can retain much of the generic simplicity of a regular search, but to exploit the unique features of data, one really does have to move on to an advanced mode. Here, by introducing a set of search definitions that gradually increase in specificity and power, I hope to convey some of the flavour of one way in which this could be done.

Let me first introduce the search: we want to track down raw NMR FID data for the 11B nucleus associated with the chemical concepts of catalytic amidation.

To understand how to construct a search query which is specific to this set of constraints, one has to understand metadata and in particular its context of describing data. This is done via a specification known as a schema. We are going to exploit one of the better known schemas for describing data, that produced by DataCite[1] (DOI: 10.14454/f2wp-s162). It can be illustrated by just three small metadata components, which can be implemented in say an XML language and the properties controlled by their specification in the schema and shown below, with the actual value of the metadata highlighted in red.

  1. <titles>
     16b. 2-((2-aminoethyl)-λ4-azaneyl)-2,4,6-tris(3,4,5-trifluorophenyl)-1,3,5,2,4,6-trioxatriborinan-2-uide
  2. <descriptions>
     <description descriptionType="Other">NMR spectra for 1H, 13C, 19F and 11B nuclei.</description>
  3. <subjects>
     <subject subjectScheme="inchi" schemeURI="http://www.inchi-trust.org/">
     <subject subjectScheme="inchikey" schemeURI="http://www.inchi-trust.org/">BHYQUOWHUMNGMD-UHFFFAOYSA-N
     <subject subjectScheme="NMR_Nucleus">11B</subject>
     <subject subjectScheme="NMR_Solvent">CDCl3</subject>

The metadata is registered with a store (MDS, DataCite in this instance) in this form and then indexed there. To search that index, we need to learn the query syntax and expression. This is illustrated below for various examples, which can be broken down into components:

  1. The prefix https://commons.datacite.org/?query= is common to all the queries, and hence is only shown for example 1.
  2. The syntax e.g. titles.title: derives from the hierarchy of the metadata, as in 1 above.
  3. Immediately followed by a search string. The * character means the string may be part of a longer string, both preceding and following the actual search string. A literal string would be enclosed in quotes, “…”
  4. Two or more separate queries can be related by a Boolean operator, as +AND+ or +OR+.
  5. The Boolean operations can be grouped using (…) to ensure the logic is unambiguous.

With the syntax dealt with, we can now proceed to some actual queries. The hits shown were obtained on the day this post was written, and may change with time (hopefully but not necessarily upwards). A brief attempt at a natural language expression of each search appears in the table below, with the Boolean operators indicated in red. Each example is elaborated below to show the logic of their evolution.

Examples 1-9 deal with keywords typically found in either the title or the description metadata fields. Because there are no hard and fast rules as to which of these two any particular keyword might be found in, searches have to be defined which allow both possibilities. Search 2 seeks to find datasets where both keywords are found in a title (or indeed titles, since multiple titles for the same dataset are allowed). Search 3 allows each term to be found in either the title(s) or the description(s) using grouping operators; the difference in hits shows the necessity of doing this. The search outlined at top also indicated we specifically wanted NMR data. Searches 4-6 search for this term in either the title or the description. We are now assuming that NMR really does relate to spectroscopy and not some other acronym in use by another community. This can be a real problem if the same term has different meanings across different subject areas. In example 7-9, we now turn to boron, since 11B NMR requires a boron compound! Allowing any of the terms to appear either as a title or a description increases the hits compared to more restricted searches.

Time now to restrict the searches even more. In the previous searches, we had identified a potential discovery lead (i.e. one we might wish to follow up in more detail). Looking this lead up, we find its molecular formula, a very useful chemical search term. Because this is quite subject specific, we now turn to <subject> rather than <title> or <description>. Search 10 illustrates how this might be done. Search 11 is even more specific; whereas it is possible that two different chemical species might share a common molecular formula (as isomers), their chemical identifier (InChI and InChiKey) should be more unique. These latter two can be generated algorithmically for any given compound and so should return information about that specific molecule. Search 12 now combines this search with the 11B nucleus specified as a description, and search 13 generalises it to title as well.

We are now ready to go to the next level of refinement, that of media types. These are descriptors which identify the type of document in which the data is held. We are all familiar with e.g. .docx as belonging to the Microsoft Word family, originating in early computer operating systems where each document or file name had two components, with the suffix indicating the application (family) likely to be able to process it or the application to be used when the document is double clicked on the desktop. So in search 14, we combine a search of NMR in the title or description with the media type application/zip. We know that Bruker spectrometers export their data in a folder containing about 24 components and this is generally packaged up as a ZIP archive to make it tractable for submission and exchange. We do not know for sure what will be in the ZIP archive, but in combination with the title/description we may be reasonably optimistic (but not certain). However, a ZIP file identified and downloaded by this procedure still has to be accessed in a manner that will recognise any NMR data therein. This function must now be devolved to whatever program is used to access the ZIP file. 

In search 15, we try to be a bit more specific by combining the molecular identifier (InChiKey) with 11B (an NMR active nucleus) in a title or description and a JCAMP-DX media type. This latter type is more clearly associated with NMR spectroscopic data in JCAMP format, so the expectation is that any hits for this search sequence should provide us with an actual NMR spectrum! There is a slight spanner in the works; we do not yet know whether to expect processed NMR data (i.e. a spectrum) or raw NMR data (i.e. an FID), since JCAMP can hold either (but not both. Most examples in fact relate to spectra). Example 16 takes us to a media type which IS known to hold both raw and spectral data concurrently, the Mnova format. But this again leads to a new issue. Mnova is commercial software and to use it you need a license. It would be indeed cruel if you managed to find some data, but then had to pay money to view it in its commercial format (although of course that is how some journals operate). Example 17 addresses that problem. The media type is associated not with a data file as such, but with a single-use license file which can be read by Mnova to license the program to read the actual data file. You can now view the data in either FID or spectral form and process the data to your heart’s content. This largely encapsulates the aspiration of the acronym FAIR. We have Found and Accessed the data, Interoperated (i.e. converted an FID to a spectrum) and Re-used it (having checked the re-use license in the metadata) to e.g. analyze the spectrum.

Example 18 takes us to our final level. Previously the acronym NMR was used as a search term. You might be surprised to learn that it can have up to 33 meanings! In this context, we are interested in only one of them (nuclear magnetic resonance). So rather than imprecisely specify it in a title or a description, we are now going to (also) give it a more precise meaning using <subject>. The exact way in which to do this is still being debated; here is one possibility. Elaborating list item 3 above, we get
which is used to disambiguate from the other 32 possible meanings of NMR. Hence we are interested specifically in the 11B nucleus. We are controlling the data itself to relate to NMR data about that nucleus, using the media type. And example 19 now specifies also that the measurement must be made in a particular solvent. There are of course many other parameters which could be used.

# Search query Hits Plain(er) English
General keywords such as Title and Description
1 https://commons.datacite.org/?query=titles.title:*amidation* 161 Amidation in title.
2 titles.title:*amidation*+AND+titles.title:*catalytic* 2 Amidation AND catalytic in title.
3 (titles.title:*amidation*+OR+descriptions.description:*amidation*)+AND+(titles.title:*catalytic*+OR+descriptions.description:*catalytic*) 28 Amidation in either title OR description AND Catalytic in either title OR description.
4 descriptions.description:*NMR* 17,978 NMR in description
5 descriptions.description:*NMR*+OR+titles.title:*NMR* 26,152 NMR in either title OR description.
6 titles.title:*boron*+AND+titles.title:*catalysed* 20 Boron AND Catalysed in title.
7 titles.title:*boron*+AND+titles.title:*catalysed*+AND+titles.title:*NMR* 1 Boron AND Catalysed AND NMR in title.
8 titles.title:*boron*+AND+titles.title:*catalysed*+AND+(titles.title:*NMR*+OR+descriptions.description:*NMR*) 3 Boron AND Catalysed in Title and NMR in either title OR description.
9 (titles.title:*boron*+OR+descriptions.description:*boron*)+AND+(titles.title:*catalysed*+OR+descriptions.description:*catalysed*)+AND+(titles.title:*NMR*+OR+descriptions.description:*NMR*) 6 Boron AND Catalysed AND NMR in either title OR description.
Discovery lead: 10.14469/hpc/2247
Subject keywords
10 subjects.subjectScheme:inchi+AND+subjects.subject:*C20H14B3F9N2O3* 1 Molecular formula in subject.
11 subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N* 1 InChIkey in subject.
12 subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+descriptions.description:*11B* 1 InChI in Subject AND 11B in description.
13 subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+(descriptions.description:*11B*+OR+titles.title:*11B*) 1 InChI in Subject AND 11B in either description OR title.
Discovery lead:10.14469/hpc/2365
14 media.media_type:application/zip+AND+(descriptions.description:*NMR*+OR+titles.title:*NMR* 219 NMR in either title OR description AND media type which might contain (Bruker spectrometer) FID data. As it happens, all 219 ZIP files in this instance do.
15 media.media_type:chemical/x-jcamp*+AND+subjects.subjectScheme:inchikey+AND+
1 InChIkey in subject AND 11B in either subject OR title AND Media type known to contain spectral NMR data (and possibly raw NMR data).
16 media.media_type:chemical/x-mnova*+AND+subjects.subjectScheme:inchikey+AND+
1 InChIkey in subject AND 11B in either subject OR title AND Media type known to contain both raw and spectral data (probably NMR)
17 media.media_type:chemical/x-mnpub*+AND+subjects.subjectScheme:inchikey+AND+
1 InChIkey in subject AND 11B in either subject OR title AND Media type known to contain a license for use of MestreNova.
18 media.media_type:chemical/x-mnpub*+AND+(subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*)+AND+(subjects.subjectScheme:NMR_Nucleus+AND+subjects.subject:11B) 1 InChIkey in subject AND 11B Nucleus in Subject AND Media type known to contain a license for use of MestreNova for the dataset.
19 media.media_type:chemical/x-mnpub*+AND+(subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*)+AND+(subjects.subjectScheme:NMR_Nucleus+AND+subjects.subject:11B)+AND+(subjects.subjectScheme:NMR_Solvent+AND+subjects.subject:CDCl3) 1 InChIkey in subject AND 11B Nucleus in Subject AND Media type known to contain both raw and spectral data AND solvent chloroform in subject.

The searches above are meant to be illustrative and to serve as a tutorial showing one way of constraining a data search to have very specific, in this example chemical, properties. Many of the examples could be tightened up further (thus making them look even more intimidating). Also, some of the precise ways of defining such constraints are still being debated. In the above, I use both the definitions found in the Schema coupled with the media types property. It would also be possible to e.g. dispense with the media types and achieve this using the other properties obtained from the schema. When the dust settles (if it ever does) on this, it is quite possible the searches will look rather different from the above. The purpose here was not to set any standards in stone, but simply to illustrate the potential of searching for data in this manner. Other methods may emerge; the Google dataset search system does not use the same schema for example and so the searches themselves would also look different.

It should also be mentioned that the examples in the table above are not likely, in their present form, to be willingly used by most chemists. These queries are largely formulated in a syntax more suited for machines than for humans. But there is nothing to prevent a more human-friendly “front end” being written that takes the quite complex syntax above and render it more usable by people. Such a front end could also absorb queries formulated against different schemas and unify them for the user.

You can see a more complete set here. Of course, the 11B nucleus can have many properties other than NMR. Programs such as MestreNova can do this, but you will need a commercial license to process in this way. If there is a media type chemical/x-mnpub also associated with the ZIP file, then this can be used in lieu of such a license key for that dataset only. See examples 17-19. Bagit is one schema for adding metadata to a container such as ZIP to indicate the contents, albeit with the requirement that the software reading the ZIP file must process this information for it to be of use. This post has DOI: drrm.


Leave a Reply