A comparison of searches based on metadata records from three (update: five) research repositories.

In the previous blog post, I looked at the metadata records registered with DataCite for some chemical computational modelling files as published in three different repositories. Here I take it one stage further, by looking at how searches of the DataCite metadata store for three particular values of the metadata associated with this dataset compare.

Search 1: The metadata value of -1705.490787 is actually the Gibbs Free energy computed for the molecule associated with the data set, a molecule which featured in this blog post https://commons.datacite.org/?query=*\-170* is an un-fielded search for the truncated string -170* (where * is a wild card character and \ is said to “escape” the minus sign, since on its own a minus can also indicate a Boolean NOT operator), resulting in 70,918 works matching the query. From what we know about the dataset in question, this is a vast number of false positives. How can we reduce them?

Search 1a: https://commons.datacite.org/?query=subjects.subject:\-170* is a fielded search, specifying that the string must occur in the subject field (62 works) but this still has 57 false positives.

Search 1b: https://commons.datacite.org/?query=subjects.subject:\-1705.490787* (in fact precision of -1705.4* is also sufficient) removes all the false positives (5 works). But are there any false negatives? In fact, for other reasons, we know that there are two works in the Figshare repository where the value of of -1705.490787 appears in the keyword items on the landing page of e.g. 10.6084/m9.figshare.16685497 and is indexed and searchable locally, but does not appear in the registered metadata and hence is not included in the results of the above searches.

Search 2: A further, formally much stronger constraint on the search is https://commons.datacite.org/?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:\-1705.490787* whereby a subjectScheme is added to search 1b, constrained to the value Gibbs_Energy. This now returns 3 works, two less than search 1b. There are two further false negatives because, as noted previously, the subjectScheme term is not defined in the Zenodo repository metadata record, where the missing two items are located. 

Search 2a: https://commons.datacite.org/?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:*1705.490787*+AND+subjects.schemeUri:*goldbook* is even further constrained to specify a  Gibbs _Energy according to the  IUPAC Gold book definition.

Search 2b: https://commons.datacite.org/?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:*1705.490787*+AND+subjects.schemeUri:*goldbook*+AND+subjects.valueUri:*gaussian* is the highest level of constraint, implying not only that the term  Gibbs_Energy is specified by the IUPAC Gold book definition, but that its value is that determined by (in this example) the Gaussian (implementation). 

So to summarise what we have thus far established, we can successfully eliminate false positives by specifying a fielded search with a requirement that the field specifically relates to Gibbs_Energy. But because of omissions in the metadata records, we also have four false negatives resulting from doing this.

Search 3https://commons.datacite.org/?query=subjects.subject:VELNVPXNOKVVTC-VJKZSTDTSA-N searches for another subject term, the InChI key for the molecule relating to the data (5 works). Here again however context for the string VELNVPXNOKVVTC-VJKZSTDTSA-N is missing, although again the string is long enough to ensure it is unique. But we could go one step further.

Search 4: https://commons.datacite.org/?query=subjects.subjectScheme:inchikey+AND+subjects.subject:VELNVPXNOKVVTC-VJKZSTDTSA-N constrains the subject term to only those strings describing an InChIkey (3 works). This again is due to Zenodo not specifying the subjectScheme and Figshare not even containing the InChIkey in its metadata record.

Search 4a: https://commons.datacite.org/?query=subjects.subjectScheme:inchikey+AND+subjects.schemeUri:*inchi-trust*+AND+subjects.subject:VELNVPXNOKVVTC-VJKZSTDTSA-N constrains the inchikey further by specifying the authority for the scheme definition as the InChI Trust. 

Search 5https://commons.datacite.org/?query=subjects.subject:InChI=1S/C25H39NO9* is query 1, but on the InChI string rather than the InChI key, and with the same results as before (5 works). Here, the string is deliberately truncated to return only the molecular formula of the molecule.

Search 5a: https://commons.datacite.org/?query=subjects.subjectScheme:inchi+AND+subjects.subject:InChI=1S/C25H39NO9* is query 4, with the subjectScheme changed to only the molecular formula component of an InChI (3 works). 

Search 5b: https://commons.datacite.org/?query=subjects.subject:InChI=1S/C25H39NO9/c1-6-26-20-24-13-9-12-14\(31-2\)10-23\(29,16\(13\)17\(12\)33-4\)25\(26,30* truncates much less of the InChI string, extending it to the molecular connection table. Notice how characters such as ( or ) have been escaped with a \ prefix. Such characters are used for grouping in the search query and so must be escaped to be included in the query.

Search 5c: https://commons.datacite.org/?query=subjects.subject:InChI=1S/C25H39NO9/c1-6-26-20-24-13-9-12-14\(31-2\)10-23\(29,16\(13\)17\(12\)33-4\)25\(26,30\)19\(34-5\)18\(24\)22\(11-27,21\(28\)35-20\)8-7-15\(24\)32-3\/h12-20,27,29-30H,6-11H2,1-5H3* For this length string (and InChI strings can get very long!) an unidentified error can occur, suggesting that the full InChI string is best not used for such searches.

Search 6: 

From these experiments, we learn that the quality and completeness/richness of the metadata record is vital to ensure no false positives or negatives are returned by the search. Ensuring such metadata richness is something that a repository should do, and it is interesting that two of the best known repositories both currently have failings in this regard. I might try one or two other popular repositories to see how they behave and will report back if I find anything interesting.


Thus https://commons.datacite.org/doi.org?query=subjects.subjectScheme:*inchikey* reveals all entries that specify an InChIkey in the subject metadata (185,414 works) but https://commons.datacite.org/doi.org?query=subjects.subjectScheme:*inchikey*+AND+subjects.schemeUri:*inchi-trust* reveals only 1748 of these further specify the InChI trust as the authority. Two more depositories, Mendeley Data and Harvard Dataverse have been populated with the same data. See here.


This post has DOI: 10.14469/hpc/9162

Tags:

Leave a Reply