{"id":22059,"date":"2020-04-11T05:51:13","date_gmt":"2020-04-11T04:51:13","guid":{"rendered":"https:\/\/www.ch.imperial.ac.uk\/rzepa\/blog\/?p=22059"},"modified":"2021-03-02T11:23:44","modified_gmt":"2021-03-02T11:23:44","slug":"a-cascading-tutorial-in-finding-rich-nmr-data-using-the-datacite-datasearch-engine","status":"publish","type":"post","link":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=22059","title":{"rendered":"A cascading tutorial in finding rich NMR data using the Datacite datasearch engine."},"content":{"rendered":"<div class=\"kcite-section\" kcite-section-id=\"22059\">\n<p>In the <a href=\"https:\/\/www.ch.imperial.ac.uk\/rzepa\/blog\/?p=22043\">previous post<\/a>, I introduced three of a new generation of search engines specialising in the discovery of data. Data has some special features which make its properties slightly different from the conceptual (or natural language) searches we are used to performing for general information and so a search engine specifically for data is invariably going to reflect this. At the simplest level, the data search can retain much of the generic simplicity of a regular search, but to exploit the unique features of data, one really does have to move on to an advanced mode. Here, by introducing a set of search definitions that gradually increase in specificity and power, I hope to convey some of the flavour of one way in which this could be done.<\/p>\n<hr style=\"border-color: blue;\" \/>\n<p>Let me first introduce the search: we want to track down <strong>raw NMR FID data<\/strong> for the <strong><sup>11<\/sup>B<\/strong> nucleus associated with the chemical concepts of <strong>catalytic amidation.<\/strong><\/p>\n<hr style=\"border-color: blue;\" \/>\n<p>To understand how to construct a search query which is specific to this set of constraints, one has to understand metadata and in particular its context of describing data. This is done <em>via<\/em> a specification known as a schema. We are going to exploit one of the better known schemas for describing data, that produced by DataCite<span id=\"cite_ITEM-22059-0\" name=\"citation\"><a href=\"#ITEM-22059-0\">[1]<\/a><\/span> (DOI: <a href=\"https:\/\/doi.org\/10.14454\/f2wp-s162\">10.14454\/f2wp-s162<\/a>). It can be illustrated by just three small metadata components, which can be implemented in say an XML language and the properties controlled by their specification in the schema and shown below, with the actual value of the metadata highlighted in red.<sup>\u2020<\/sup><\/p>\n<ol>\n<li>\n<pre><span style=\"font-size: 7pt;\"><small><tt>&lt;titles&gt;\r\n &lt;title&gt;\r\n <span style=\"color: #ff0000;\"><strong>16b. 2-((2-aminoethyl)-\u03bb4-azaneyl)-2,4,6-tris(3,4,5-trifluorophenyl)-1,3,5,2,4,6-trioxatriborinan-2-uide<\/strong><\/span>\r\n &lt;\/title&gt;\r\n&lt;\/titles&gt;<\/tt><\/small>\r\n<\/span><\/pre>\n<\/li>\n<li>\n<pre><span style=\"font-size: 7pt;\"><small><tt>&lt;descriptions&gt;\r\n &lt;description descriptionType=\"Other\"&gt;<span style=\"color: #ff0000;\"><strong>NMR spectra for 1H, 13C, 19F and 11B nuclei<\/strong><\/span>.&lt;\/description&gt;\r\n&lt;\/descriptions&gt;<\/tt><\/small>\r\n<\/span><\/pre>\n<\/li>\n<li>\n<pre><span style=\"font-size: 7pt;\"><small><tt>&lt;subjects&gt;\r\n<\/tt><\/small><small><tt> &lt;subject subjectScheme=\"inchi\" schemeURI=\"http:\/\/www.inchi-trust.org\/\"&gt;<span style=\"color: #ff0000;\">\r\n <strong>InChI=1S\/C20H14B3F9N2O3\/c24-12-3-9(4-13(25)18(12)30)21-35-22(10-5-14(26)19(31)15(27)6-10)37-23(36-21,34-2-1-33)11-7-16(28)20(32)17(29)8-11\/h3-8H,1-2,33-34H2\/q-1<\/strong><\/span>\r\n  &lt;\/subject&gt;\r\n &lt;subject subjectScheme=\"inchikey\" schemeURI=\"http:\/\/www.inchi-trust.org\/\"&gt;<span style=\"color: #ff0000;\"><strong>BHYQUOWHUMNGMD-UHFFFAOYSA-N<\/strong><\/span>\r\n  &lt;\/subject&gt;\r\n &lt;subject subjectScheme=\"NMR_Nucleus\"&gt;<span style=\"color: #ff0000;\"><strong>11<\/strong>B<\/span>&lt;\/subject&gt;\r\n &lt;subject subjectScheme=\"NMR_Solvent\"&gt;<span style=\"color: #ff0000;\"><strong>CDCl3<\/strong><\/span>&lt;\/subject&gt;\r\n&lt;\/subjects&gt;<\/tt><\/small>\r\n<\/span><\/pre>\n<\/li>\n<\/ol>\n<p>The metadata is registered with a store (MDS, DataCite in this instance) in this form and then indexed there. To search that index, we need to learn the query syntax and expression. This is illustrated below for various examples, which can be broken down into components:<\/p>\n<ol start=\"4\">\n<li>The prefix <b><tt>https:\/\/commons.datacite.org\/?query=<\/tt><\/b> is common to all the queries, and hence is only shown for example 1.<\/li>\n<li>The syntax <em>e.g.<\/em> <small><b><tt>titles.title:<\/tt><\/b><\/small> derives from the hierarchy of the metadata, as in <b>1<\/b> above.<\/li>\n<li>Immediately followed by a search string. The * character means the string may be part of a longer string, both preceding and following the actual search string. A literal string would be enclosed in quotes, &#8220;&#8230;&#8221;<\/li>\n<li>Two or more separate queries can be related by a Boolean operator, as <b><tt>+AND+<\/tt><\/b> or\u00a0<b><tt>+OR+<\/tt><\/b>.<\/li>\n<li>The Boolean operations can be grouped using (&#8230;) to ensure the logic is unambiguous.<\/li>\n<\/ol>\n<p>With the syntax dealt with, we can now proceed to some actual queries. The hits shown were obtained on the day this post was written, and may change with time (hopefully but not necessarily upwards). A brief attempt at a natural language expression of each search appears in the table below, with the Boolean operators indicated in red. Each example is elaborated below to show the logic of their evolution.<\/p>\n<p><strong>Examples 1-9<\/strong> deal with keywords typically found in either the title or the description metadata fields. Because there are no hard and fast rules as to which of these two any particular keyword might be found in, searches have to be defined which allow both possibilities.\u00a0<strong>Search 2<\/strong> seeks to find datasets where <strong>both<\/strong> keywords are found in a title (or indeed titles, since multiple titles for the same dataset are allowed). <strong>Search 3<\/strong> allows each term to be found in either the title(s) or the description(s) using grouping operators; the difference in hits shows the necessity of doing this. The search outlined at top also indicated we specifically wanted<strong> NMR<\/strong> data. <strong>Searches 4-6<\/strong>\u00a0search for this term in either the title or the description. We are now assuming that NMR really does relate to spectroscopy and not some other acronym in use by another community. This can be a real problem if the same term has different meanings across different subject areas. In<strong> example 7-9,<\/strong> we now turn to boron, since <sup>11<\/sup>B NMR requires a boron compound! Allowing any of the terms to appear either as a title or a description increases the hits compared to more restricted searches.<\/p>\n<p>Time now to restrict the searches even more. In the previous searches, we had identified a potential discovery lead <em>(i.e.<\/em> one we might wish to follow up in more detail). Looking this lead up, we find its molecular formula, a very useful chemical search term. Because this is quite subject specific, we now turn to <strong>&lt;subject&gt;<\/strong> rather than<strong> &lt;title&gt;<\/strong> or <strong>&lt;description&gt;<\/strong>. <strong>Search 10<\/strong> illustrates how this might be done. <strong>Search 11<\/strong> is even more specific; whereas it is possible that two different chemical species might share a common molecular formula (as isomers), their chemical identifier (InChI and InChiKey) should be more unique. These latter two can be generated algorithmically for any given compound and so should return information about that specific molecule. <strong>Search 12<\/strong> now combines this search with the <sup>11<\/sup>B nucleus specified as a description, and <strong>search 13<\/strong> generalises it to title as well.<\/p>\n<p>We are now ready to go to the next level of refinement, that of<strong> media types<\/strong>. These are descriptors which identify the type of document in which the data is held. We are all familiar with <em>e.g.<\/em> .<strong>docx<\/strong> as belonging to the Microsoft Word family, originating in early computer operating systems where each document or file name had two components, with the suffix indicating the application (family) likely to be able to process it or the application to be used when the document is double clicked on the desktop. So in <strong>search 14<\/strong>, we combine a search of NMR in the title or description with the media type application\/zip. We know that Bruker spectrometers export their data in a folder containing about 24 components and this is generally packaged up as a ZIP archive to make it tractable for submission and exchange. We do not know for sure what will be in the ZIP archive,<sup>\u2660<\/sup> but in combination with the title\/description we may be reasonably optimistic (but not certain). However, a ZIP file identified and downloaded by this procedure still has to be accessed in a manner that will recognise any NMR data therein. This function must now be devolved to whatever program is used to access the ZIP file.<sup><span style=\"color: #ff0000;\">\u2665<\/span><\/sup>\u00a0<\/p>\n<p>In\u00a0<strong>search 15<\/strong>, we try to be a bit more specific by combining the molecular identifier (InChiKey) with\u00a0<sup>11<\/sup>B (an NMR active nucleus) in a title or description and a JCAMP-DX media type. This latter type is more clearly associated with NMR spectroscopic data in JCAMP format, so the expectation is that any hits for this search sequence should provide us with an actual NMR spectrum! There is a slight spanner in the works; we do not yet know whether to expect processed NMR data (<em>i.e.<\/em> a spectrum) or raw NMR data (<em>i.e.<\/em> an FID), since JCAMP can hold either (but not both. Most examples in fact relate to spectra). <strong>Example 16<\/strong>\u00a0takes us to a media type which IS known to hold both raw and spectral data concurrently, the Mnova format. But this again leads to a new issue. Mnova is commercial software and to use it you need a license.<span style=\"color: #ff0000;\"><sup>\u2665<\/sup><\/span> It would be indeed cruel if you managed to find some data, but then had to pay money to view it in its commercial format (although of course that is how some journals operate). <strong>Example 17<\/strong>\u00a0addresses that problem. The media type is associated not with a data file as such, but with a single-use license file which can be read by Mnova to license the program to read the actual data file. You can now view the data in either FID or spectral form and process the data to your heart&#8217;s content. This largely encapsulates the aspiration of the acronym <strong>FAIR<\/strong>.\u00a0We have <strong>F<\/strong>ound and <strong>A<\/strong>ccessed the data, <strong>I<\/strong>nteroperated (<em>i.e.<\/em> converted an FID to a spectrum) and <strong>R<\/strong>e-used it (having checked the re-use license in the metadata) to <em>e.g.<\/em> analyze the spectrum.<\/p>\n<p><strong>Example 18<\/strong>\u00a0takes us to our final level. Previously the acronym <strong>NMR<\/strong> was used as a search term. You might be surprised to learn that it can have up to <a href=\"https:\/\/www.acronymfinder.com\/NMR.html\">33 meanings<\/a>! In this context, we are interested in only one of them (nuclear magnetic resonance). So rather than imprecisely specify it in a title or a description, we are now going to (also) give it a more precise meaning using &lt;subject&gt;. The exact way in which to do this is still being debated; here is one possibility. Elaborating list item 3 above, we get<br \/>\n<tt>subjects.subjectScheme:NMR_Nucleus+AND+subjects.subject:11B<\/tt><br \/>\nwhich is used to disambiguate from the other 32 possible meanings of NMR. Hence we are interested specifically in the <sup>11<\/sup>B nucleus.<sup>\u2021<\/sup> We are controlling the data itself to relate to NMR data about that nucleus, using the media type. And <strong>example 19<\/strong>\u00a0now specifies also that the measurement must be made in a particular solvent. There are of course many other parameters which could be used.<\/p>\n<table border=\"1\">\n<tbody>\n<tr>\n<th><span style=\"font-size: 10pt;\">#<\/span><\/th>\n<th style=\"max-width: 350px;\"><span style=\"font-size: 10pt;\">Search query<\/span><\/th>\n<th><span style=\"font-size: 10pt;\">Hits<\/span><\/th>\n<th><span style=\"font-size: 10pt;\">Plain(er) English<br \/>\ndescription<\/span><\/th>\n<\/tr>\n<tr>\n<th colspan=\"4\"><span style=\"font-size: 10pt;\">General keywords such as Title and Description<\/span><\/th>\n<\/tr>\n<tr>\n<td><span style=\"font-size: 7pt;\">1<\/span><\/td>\n<td style=\"max-width: 350pt;\"><span style=\"font-size: 7pt;\"><a href=\"https:\/\/commons.datacite.org\/?query=titles.title:*amidation*\" target=\"references\" rel=\"noopener noreferrer\">https:\/\/commons.datacite.org\/?query=titles.title:*amidation*<\/a><\/span><\/td>\n<td><span style=\"font-size: 7pt;\">161<\/span><\/td>\n<td><span style=\"font-size: 7pt;\">Amidation in title.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-size: 7pt;\">2<\/span><\/td>\n<td style=\"max-width: 350pt;\"><span style=\"font-size: 7pt;\"><a href=\"https:\/\/commons.datacite.org\/?query=titles.title:*amidation*+AND+titles.title:*catalytic*\" target=\"references\" rel=\"noopener noreferrer\">titles.title:*amidation*+AND+titles.title:*catalytic*<\/a><\/span><\/td>\n<td><span style=\"font-size: 7pt;\">2<\/span><\/td>\n<td><span style=\"font-size: 7pt;\">Amidation <b style=\"color: red;\">AND<\/b> catalytic in title.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-size: 7pt;\">3<\/span><\/td>\n<td style=\"max-width: 350pt;\"><span style=\"font-size: 7pt;\"><a href=\"https:\/\/commons.datacite.org\/?query=(titles.title:*amidation*+OR+descriptions.description:*amidation*)+AND+(titles.title:*catalytic*+OR+descriptions.description:*catalytic*)\" target=\"references\" rel=\"noopener noreferrer\">(titles.title:*amidation*+OR+descriptions.description:*amidation*)+AND+(titles.title:*catalytic*+OR+descriptions.description:*catalytic*)<\/a><\/span><\/td>\n<td><span style=\"font-size: 7pt;\">28<\/span><\/td>\n<td><span style=\"font-size: 7pt;\">Amidation in either title <b style=\"color: red;\">OR<\/b> description <b style=\"color: red;\">AND<\/b> Catalytic in either title <b style=\"color: red;\">OR<\/b> description.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-size: 7pt;\">4<\/span><\/td>\n<td style=\"max-width: 350pt;\"><span style=\"font-size: 7pt;\"><a href=\"https:\/\/commons.datacite.org\/?query=descriptions.description:*NMR*\" target=\"references\" rel=\"noopener noreferrer\">descriptions.description:*NMR*<\/a><\/span><\/td>\n<td><span style=\"font-size: 7pt;\">17,978<\/span><\/td>\n<td><span style=\"font-size: 7pt;\">NMR in description<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-size: 7pt;\">5<\/span><\/td>\n<td style=\"max-width: 350pt;\"><span style=\"font-size: 7pt;\"><a href=\"https:\/\/commons.datacite.org\/?query=descriptions.description:*NMR*+OR+titles.title:*NMR*\" target=\"references\" rel=\"noopener noreferrer\">descriptions.description:*NMR*+OR+titles.title:*NMR*<\/a><\/span><\/td>\n<td><span style=\"font-size: 7pt;\">26,152<\/span><\/td>\n<td><span style=\"font-size: 7pt;\">NMR in either title <b style=\"color: red;\">OR<\/b> description.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-size: 7pt;\">6<\/span><\/td>\n<td style=\"max-width: 350pt;\"><span style=\"font-size: 7pt;\"><a href=\"https:\/\/commons.datacite.org\/?query=titles.title:*boron*+AND+titles.title:*catalysed*\" target=\"references\" rel=\"noopener noreferrer\">titles.title:*boron*+AND+titles.title:*catalysed*<\/a><\/span><\/td>\n<td><span style=\"font-size: 7pt;\">20<\/span><\/td>\n<td><span style=\"font-size: 7pt;\">Boron <b style=\"color: red;\">AND<\/b> Catalysed in title.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-size: 7pt;\">7<\/span><\/td>\n<td style=\"max-width: 350pt;\"><span style=\"font-size: 7pt;\"><a href=\"https:\/\/commons.datacite.org\/?query=titles.title:*boron*+AND+titles.title:*catalysed*+AND+titles.title:*NMR*\" target=\"references\" rel=\"noopener noreferrer\">titles.title:*boron*+AND+titles.title:*catalysed*+AND+titles.title:*NMR*<\/a><\/span><\/td>\n<td><span style=\"font-size: 7pt;\">1<\/span><\/td>\n<td><span style=\"font-size: 7pt;\">Boron <b style=\"color: red;\">AND<\/b> Catalysed <b style=\"color: red;\">AND<\/b> NMR in title.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-size: 7pt;\">8<\/span><\/td>\n<td style=\"max-width: 350pt;\"><span style=\"font-size: 7pt;\"><a href=\"https:\/\/commons.datacite.org\/?query=titles.title:*boron*+AND+titles.title:*catalysed*+AND+(titles.title:*NMR*+OR+descriptions.description:*NMR*)\" target=\"references\" rel=\"noopener noreferrer\">titles.title:*boron*+AND+titles.title:*catalysed*+AND+(titles.title:*NMR*+OR+descriptions.description:*NMR*)<\/a><\/span><\/td>\n<td><span style=\"font-size: 7pt;\">3<\/span><\/td>\n<td><span style=\"font-size: 7pt;\">Boron <b style=\"color: red;\">AND<\/b> Catalysed in Title and NMR in either title <b style=\"color: red;\">OR<\/b> description.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-size: 7pt;\">9<\/span><\/td>\n<td style=\"max-width: 350pt;\"><span style=\"font-size: 7pt;\"><a href=\"https:\/\/commons.datacite.org\/?query=(titles.title:*boron*+OR+descriptions.description:*boron*)+AND+(titles.title:*catalysed*+OR+descriptions.description:*catalysed*)+AND+(titles.title:*NMR*+OR+descriptions.description:*NMR*)\" target=\"references\" rel=\"noopener noreferrer\">(titles.title:*boron*+OR+descriptions.description:*boron*)+AND+(titles.title:*catalysed*+OR+descriptions.description:*catalysed*)+AND+(titles.title:*NMR*+OR+descriptions.description:*NMR*)<\/a><\/span><\/td>\n<td style=\"height: 88px;\"><span style=\"font-size: 7pt;\">6<\/span><\/td>\n<td style=\"height: 88px;\"><span style=\"font-size: 7pt;\">Boron <b style=\"color: red;\">AND<\/b> Catalysed <b style=\"color: red;\">AND<\/b> NMR in either title <b style=\"color: red;\">OR<\/b> description.<\/span><\/td>\n<\/tr>\n<tr>\n<td colspan=\"4\"><span style=\"font-size: 10pt; text-align: center;\">Discovery lead: <a href=\"https:\/\/doi.org\/10.14469\/hpc\/2247\" target=\"references\" rel=\"noopener noreferrer\">10.14469\/hpc\/2247<\/a><\/span><\/td>\n<\/tr>\n<tr>\n<th colspan=\"4\"><span style=\"font-size: 10pt;\">Subject keywords<\/span><\/th>\n<\/tr>\n<tr>\n<td><span style=\"font-size: 7pt;\">10<\/span><\/td>\n<td style=\"max-width: 350pt;\"><span style=\"font-size: 7pt;\"><a href=\"https:\/\/commons.datacite.org\/?query=subjects.subjectScheme:inchi+AND+subjects.subject:*C20H14B3F9N2O3*\" target=\"references\" rel=\"noopener noreferrer\">subjects.subjectScheme:inchi+AND+subjects.subject:*C20H14B3F9N2O3*<\/a><\/span><\/td>\n<td><span style=\"font-size: 7pt;\">1<\/span><\/td>\n<td><span style=\"font-size: 7pt;\">Molecular formula in subject.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-size: 7pt;\">11<\/span><\/td>\n<td style=\"max-width: 350pt;\"><span style=\"font-size: 7pt;\"><a href=\"https:\/\/commons.datacite.org\/?query=subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*\" target=\"references\" rel=\"noopener noreferrer\">subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*<\/a><\/span><\/td>\n<td><span style=\"font-size: 7pt;\">1<\/span><\/td>\n<td><span style=\"font-size: 7pt;\">InChIkey in subject.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-size: 7pt;\">12<\/span><\/td>\n<td style=\"max-width: 350pt;\"><span style=\"font-size: 7pt;\"><a href=\"https:\/\/commons.datacite.org\/?query=subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+descriptions.description:*11B*\" target=\"references\" rel=\"noopener noreferrer\">subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+descriptions.description:*11B*<\/a><\/span><\/td>\n<td><span style=\"font-size: 7pt;\">1<\/span><\/td>\n<td><span style=\"font-size: 7pt;\">InChI in Subject <b style=\"color: red;\">AND<\/b> <sup>11<\/sup>B in description.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-size: 7pt;\">13<\/span><\/td>\n<td style=\"max-width: 350pt;\"><span style=\"font-size: 7pt;\"><a href=\"https:\/\/commons.datacite.org\/?query=subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+(descriptions.description:*11B*+OR+titles.title:*11B*)\" target=\"references\" rel=\"noopener noreferrer\">subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+(descriptions.description:*11B*+OR+titles.title:*11B*)<\/a><\/span><\/td>\n<td><span style=\"font-size: 7pt;\">1<\/span><\/td>\n<td><span style=\"font-size: 7pt;\">InChI in Subject <b style=\"color: red;\">AND<\/b> <sup>11<\/sup>B in either description <b style=\"color: red;\">OR<\/b> title.<\/span><\/td>\n<\/tr>\n<tr>\n<td colspan=\"4\"><span style=\"font-size: 10pt;\">Discovery lead:<a href=\"https:\/\/doi.org\/10.14469\/hpc\/2365\" target=\"references\" rel=\"noopener noreferrer\">10.14469\/hpc\/2365<\/a><\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-size: 7pt;\">14<\/span><\/td>\n<td style=\"max-width: 350pt;\"><span style=\"font-size: 7pt;\"><a href=\"https:\/\/commons.datacite.org\/?query=media.media_type:application\/zip+AND+(descriptions.description:*NMR*+OR+titles.title:*NMR*)\" target=\"references\" rel=\"noopener noreferrer\">media.media_type:application\/zip+AND+(descriptions.description:*NMR*+OR+titles.title:*NMR*<\/a><\/span><\/td>\n<td><span style=\"font-size: 7pt;\">219<\/span><\/td>\n<td><span style=\"font-size: 7pt;\">NMR in either title <b style=\"color: red;\">OR<\/b> description <strong><span style=\"color: #ff0000;\">AND<\/span><\/strong> media type which might contain (Bruker spectrometer) FID data. As it happens, all 219 ZIP files in this instance do.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-size: 7pt;\">15<\/span><\/td>\n<td style=\"max-width: 350pt;\"><span style=\"font-size: 7pt;\"><a href=\"https:\/\/commons.datacite.org\/?query=media.media_type:chemical\/x-jcamp*+AND+subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+(descriptions.description:*11B*+OR+titles.title:*11B*)\" target=\"references\" rel=\"noopener noreferrer\">media.media_type:chemical\/x-jcamp*+AND+subjects.subjectScheme:inchikey+AND+<br \/>\nsubjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+(descriptions.description:*11B*+OR+titles.title:*11B*)<\/a><\/span><\/td>\n<td><span style=\"font-size: 7pt;\">1<\/span><\/td>\n<td><span style=\"font-size: 7pt;\">InChIkey in subject <b style=\"color: red;\">AND<\/b> <sup>11<\/sup>B in either subject <b style=\"color: red;\">OR<\/b> title <b style=\"color: red;\">AND<\/b> Media type known to contain spectral NMR data (and possibly raw NMR data).<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-size: 7pt;\">16<\/span><\/td>\n<td style=\"max-width: 350pt;\"><span style=\"font-size: 7pt;\"><a href=\"https:\/\/commons.datacite.org\/?query=media.media_type:chemical\/x-mnova*+AND+subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+(descriptions.description:*11B*+OR+titles.title:*11B*)\" target=\"references\" rel=\"noopener noreferrer\">media.media_type:chemical\/x-mnova*+AND+subjects.subjectScheme:inchikey+AND+<br \/>\nsubjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+descriptions.description:*11B*<\/a><\/span><\/td>\n<td><span style=\"font-size: 7pt;\">1<\/span><\/td>\n<td><span style=\"font-size: 7pt;\">InChIkey in subject <b style=\"color: red;\">AND<\/b> <sup>11<\/sup>B in either subject <b style=\"color: red;\">OR<\/b> title <b style=\"color: red;\">AND<\/b> Media type known to contain both raw and spectral data (probably NMR)<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-size: 7pt;\">17<\/span><\/td>\n<td style=\"max-width: 350pt;\"><span style=\"font-size: 7pt;\"><a href=\"https:\/\/commons.datacite.org\/?query=media.media_type:chemical\/x-mnpub*+AND+subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+(descriptions.description:*11B*+OR+titles.title:*11B*)\" target=\"references\" rel=\"noopener noreferrer\">media.media_type:chemical\/x-mnpub*+AND+subjects.subjectScheme:inchikey+AND+<br \/>\nsubjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+descriptions.description:*11B*<\/a><\/span><\/td>\n<td><span style=\"font-size: 7pt;\">1<\/span><\/td>\n<td><span style=\"font-size: 7pt;\">InChIkey in subject <b style=\"color: red;\">AND<\/b> <sup>11<\/sup>B in either subject <b style=\"color: red;\">OR<\/b> title <b style=\"color: red;\">AND<\/b> Media type known to contain a license for use of MestreNova.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-size: 7pt;\">18<\/span><\/td>\n<td style=\"max-width: 350pt;\"><span style=\"font-size: 7pt;\"><a href=\"https:\/\/commons.datacite.org\/?query=media.media_type:chemical\/x-mnpub*+AND+(subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*)+AND+(subjects.subjectScheme:NMR_Nucleus+AND+subjects.subject:11B)\" target=\"references\" rel=\"noopener noreferrer\">media.media_type:chemical\/x-mnpub*+AND+(subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*)+AND+(subjects.subjectScheme:NMR_Nucleus+AND+subjects.subject:11B)<\/a><\/span><\/td>\n<td style=\"height: 159px;\"><span style=\"font-size: 7pt;\">1<\/span><\/td>\n<td style=\"height: 159px;\"><span style=\"font-size: 7pt;\">InChIkey in subject <b style=\"color: red;\">AND<\/b> <sup>11<\/sup>B Nucleus in Subject <b style=\"color: red;\">AND<\/b> Media type known to contain a license for use of MestreNova for the dataset.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-size: 7pt;\">19<\/span><\/td>\n<td style=\"max-width: 350pt;\"><span style=\"font-size: 7pt;\"><a href=\"https:\/\/commons.datacite.org\/?query=media.media_type:chemical\/x-mnova*+AND+(subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*)+AND+(subjects.subjectScheme:NMR_Nucleus+AND+subjects.subject:11B)+AND+(subjects.subjectScheme:NMR_Solvent+AND+subjects.subject:CDCl3)\" target=\"references\" rel=\"noopener noreferrer\">media.media_type:chemical\/x-mnpub*+AND+(subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*)+AND+(subjects.subjectScheme:NMR_Nucleus+AND+subjects.subject:11B)+AND+(subjects.subjectScheme:NMR_Solvent+AND+subjects.subject:CDCl3)<\/a><\/span><\/td>\n<td><span style=\"font-size: 7pt;\">1<\/span><\/td>\n<td><span style=\"font-size: 7pt;\">InChIkey in subject <b style=\"color: red;\">AND<\/b> <sup>11<\/sup>B Nucleus in Subject <b style=\"color: red;\">AND<\/b> Media type known to contain both raw and spectral data <b style=\"color: red;\">AND<\/b> solvent chloroform in subject.<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The searches above are meant to be illustrative and to serve as a tutorial showing one way of constraining a data search to have very specific, in this example chemical, properties. Many of the examples could be tightened up further (thus making them look even more intimidating). Also, some of the precise ways of defining such constraints are still being debated. In the above, I use both the definitions found in the Schema coupled with the media types property. It would also be possible to <em>e.g.<\/em> dispense with the media types and achieve this using the other properties obtained from the schema. When the dust settles (if it ever does) on this, it is quite possible the searches will look rather different from the above. The purpose here was not to set any standards in stone, but simply to illustrate the potential of searching for data in this manner. Other methods may emerge; the Google dataset search system does not use the same schema for example and so the searches themselves would also look different.<\/p>\n<p>It should also be mentioned that the examples in the table above are not likely, in their present form, to be willingly used by most chemists. These queries are largely formulated in a syntax more suited for machines than for humans. But there is nothing to prevent a more human-friendly &#8220;front end&#8221; being written that takes the quite complex syntax above and render it more usable by people. Such a front end could also absorb queries formulated against different schemas and unify them for the user.<\/p>\n<hr \/>\n<p><small><sup>\u2020<\/sup>You can see a more complete set <a href=\"https:\/\/data.datacite.org\/application\/vnd.datacite.datacite+xml\/10.14469\/hpc\/2365\">here<\/a>. <sup>\u2021<\/sup>Of course, the <sup>11<\/sup>B nucleus can have many properties other than NMR. <span style=\"color: #ff0000;\"><sup>\u2665<\/sup><\/span>Programs such as MestreNova can do this, but you will need a commercial license to process in this way. If there is a media type chemical\/x-mnpub also associated with the ZIP file, then this can be used in lieu of such a license key for that dataset only. See examples 17-19. <sup>\u2660<\/sup><a href=\"https:\/\/en.wikipedia.org\/wiki\/BagIt\" target=\"_blank\" rel=\"noopener noreferrer\">Bagit<\/a> is one schema for adding metadata to a container such as ZIP to indicate the contents, albeit with the requirement that the software reading the ZIP file must process this information for it to be of\u00a0use.<\/small> This post has DOI: <a href=\"https:\/\/doi.org\/drrm\">drrm<\/a>.<\/p>\n<h2>References<\/h2>\n    <ol class=\"kcite-bibliography csl-bib-body\"><li id=\"ITEM-22059-0\">DataCite Metadata Working Group., \"DataCite Metadata Schema for the Publication and Citation of Research Data v4.3\", <i>DataCite<\/i>, 2019. <a href=\"https:\/\/doi.org\/10.14454\/f2wp-s162\">https:\/\/doi.org\/10.14454\/f2wp-s162<\/a>\n\n<\/li>\n<\/ol>\n\n<\/div> <!-- kcite-section 22059 -->","protected":false},"excerpt":{"rendered":"<p>In the previous post, I introduced three of a new generation of search engines specialising in the discovery of data. Data has some special features which make its properties slightly different from the conceptual (or natural language) searches we are used to performing for general information and so a search engine specifically for data is [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":5,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[2],"tags":[],"ppma_author":[2661],"class_list":["post-22059","post","type-post","status-publish","format-standard","hentry","category-chemical-it"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>A cascading tutorial in finding rich NMR data using the Datacite datasearch engine. - Henry Rzepa&#039;s Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=22059\" \/>\n<meta property=\"og:locale\" content=\"en_GB\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"A cascading tutorial in finding rich NMR data using the Datacite datasearch engine. - Henry Rzepa&#039;s Blog\" \/>\n<meta property=\"og:description\" content=\"In the previous post, I introduced three of a new generation of search engines specialising in the discovery of data. Data has some special features which make its properties slightly different from the conceptual (or natural language) searches we are used to performing for general information and so a search engine specifically for data is [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=22059\" \/>\n<meta property=\"og:site_name\" content=\"Henry Rzepa&#039;s Blog\" \/>\n<meta property=\"article:published_time\" content=\"2020-04-11T04:51:13+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2021-03-02T11:23:44+00:00\" \/>\n<meta name=\"author\" content=\"Henry Rzepa\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Henry Rzepa\" \/>\n\t<meta name=\"twitter:label2\" content=\"Estimated reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"13 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"A cascading tutorial in finding rich NMR data using the Datacite datasearch engine. - Henry Rzepa&#039;s Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=22059","og_locale":"en_GB","og_type":"article","og_title":"A cascading tutorial in finding rich NMR data using the Datacite datasearch engine. - Henry Rzepa&#039;s Blog","og_description":"In the previous post, I introduced three of a new generation of search engines specialising in the discovery of data. Data has some special features which make its properties slightly different from the conceptual (or natural language) searches we are used to performing for general information and so a search engine specifically for data is [&hellip;]","og_url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=22059","og_site_name":"Henry Rzepa&#039;s Blog","article_published_time":"2020-04-11T04:51:13+00:00","article_modified_time":"2021-03-02T11:23:44+00:00","author":"Henry Rzepa","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Henry Rzepa","Estimated reading time":"13 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=22059#article","isPartOf":{"@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=22059"},"author":{"name":"Henry Rzepa","@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/#\/schema\/person\/2b40f7b9c872a4dc1547e040a11b6281"},"headline":"A cascading tutorial in finding rich NMR data using the Datacite datasearch engine.","datePublished":"2020-04-11T04:51:13+00:00","dateModified":"2021-03-02T11:23:44+00:00","mainEntityOfPage":{"@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=22059"},"wordCount":2394,"commentCount":0,"articleSection":["Chemical IT"],"inLanguage":"en-GB","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=22059#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=22059","url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=22059","name":"A cascading tutorial in finding rich NMR data using the Datacite datasearch engine. - Henry Rzepa&#039;s Blog","isPartOf":{"@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/#website"},"datePublished":"2020-04-11T04:51:13+00:00","dateModified":"2021-03-02T11:23:44+00:00","author":{"@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/#\/schema\/person\/2b40f7b9c872a4dc1547e040a11b6281"},"breadcrumb":{"@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=22059#breadcrumb"},"inLanguage":"en-GB","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=22059"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=22059#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog"},{"@type":"ListItem","position":2,"name":"A cascading tutorial in finding rich NMR data using the Datacite datasearch engine."}]},{"@type":"WebSite","@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/#website","url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/","name":"Henry Rzepa&#039;s Blog","description":"Chemistry with a twist","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-GB"},{"@type":"Person","@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/#\/schema\/person\/2b40f7b9c872a4dc1547e040a11b6281","name":"Henry Rzepa","image":{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/secure.gravatar.com\/avatar\/897b6740f7f599bca7942cdf7d7914af5988937ae0e3869ab09aebb87f26a731?s=96&d=blank&r=g370be3a7397865e4fd161aefeb0a5a85","url":"https:\/\/secure.gravatar.com\/avatar\/897b6740f7f599bca7942cdf7d7914af5988937ae0e3869ab09aebb87f26a731?s=96&d=blank&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/897b6740f7f599bca7942cdf7d7914af5988937ae0e3869ab09aebb87f26a731?s=96&d=blank&r=g","caption":"Henry Rzepa"},"description":"Henry Rzepa is Emeritus Professor of Computational Chemistry at Imperial College London.","sameAs":["https:\/\/orcid.org\/0000-0002-8635-8390"],"url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?author=1"}]}},"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pDef7-5JN","jetpack-related-posts":[],"jetpack_likes_enabled":false,"authors":[{"term_id":2661,"user_id":1,"is_guest":0,"slug":"admin","display_name":"Henry Rzepa","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/897b6740f7f599bca7942cdf7d7914af5988937ae0e3869ab09aebb87f26a731?s=96&d=blank&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=\/wp\/v2\/posts\/22059","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=22059"}],"version-history":[{"count":88,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=\/wp\/v2\/posts\/22059\/revisions"}],"predecessor-version":[{"id":23404,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=\/wp\/v2\/posts\/22059\/revisions\/23404"}],"wp:attachment":[{"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=22059"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=22059"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=22059"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fppma_author&post=22059"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}