The number of repositories which accept research data across a wide spectrum of disciplines is on the up. Here I report the results of conducting an experiment in which chemical modelling data was deposited in six such repositories and comparing the richness of the metadata describing the essential properties of the six depositions.
The repositories are as follows:
Each deposition results in the generation of a DOI, and these, together with the link that allows access to the associated metadata can be seen in the table below.
Repository | Dataset DOI | Dataset metadata |
---|---|---|
Figshare | 10.6084/m9.figshare.16685497 | XML JSON |
Zenodo | 10.5281/zenodo.5511966 | XML JSON |
Imperial College | 10.14469/hpc/9031 | XML JSON |
Harvard Dataverse | 10.7910/DVN/4BWOYK | XML XML Codebook |
Mendeley Data | 10.17632/dgtvds3xn5.1 | XML JSON |
ScienceDB | 10.11922/sciencedb.01522 | XML♠ JSON♠ |
I would note that manual deposition can be rather dependent on how fastidious the depositor is and how they interpret the descriptive keywords that Figshare and Zenodo accept. Automated deposition is a more controlled process, in which the required keywords are a property programmed into the submitting portal tool. Such a process also allows metadata to describe relationships between different datasets, such as a dataset collection, and is inherited from project descriptor on the portal. Additionally, the automated process can then be augmented by manual editing of the metadata record, as for example, the addition of the DOI for this descriptive post which can be added to the metadata records retrospectively. In the case of e.g. Zenodo, retrospective changes to the metadata record require a new DOI to be generated to reflect the changes.†
You can inspect the results of these three depositions yourself by downloading the respective metadata records and viewing the downloaded file using a simple text or XML editor.
<creator> <creatorName>Rzepa, Henry S.</creatorName> <givenName>Henry S.</givenName> <familyName>Rzepa</familyName> <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="https://orcid.org"> https://orcid.org/0000-0002-8635-8390 </nameIdentifier> </creator>
The widespread addition of the unique ORCID researcher identifier is very welcome.
<subjects> <subject>Computational Chemistry</subject> <subject>Organic Chemistry</subject> <subject subjectScheme="Fields of Science and Technology (FOS)" schemeURI="http://www.oecd.org/science/inno/38235147.pdf">FOS: Chemical sciences</subject> </subjects>
The context of these keywords is clearly defined by the value of the subjectScheme (chemical sciences) but this term is very broad and does not relate very specifically to the deposited data. The more chemically specific keywords themselves are only displayed on the landing page for the entry as shown below and are not expressed in any metadata container, which means that they are not indexed and hence searchable using the DataCite metadata store.
<subjects> <subject>-1705.490787</subject> <subject>InChI=1S/C25H39NO9/c1-6-26-20-24-13-9-12-14(31-2)10-23(29,16(13)17(12)33-4)25(26,30)19(34-5)18(24)22(11-27,21(28)35-20)8-7-15(24)32-3/h12-20,27,29-30H,6-11H2,1-5H3/t12-,13-,14+,15+,16-,17+,18-,19+,20+,22+,23-,24+,25+/m1/s1</subject> <subject>VELNVPXNOKVVTC-VJKZSTDTSA-N</subject> </subjects>
However, you might be wondering what the keyword -1705.490787 is all about. Put simply, in this form of expression it has absolutely no context. I previously explained why it might be useful if context is added, it being a persistent identifier for (some) quantum chemical calculations in the form of a computed total energy corrected thermally into a Gibbs energy.♥ The persistence in this case is acquired not by registration with an agency but generation by an algorithm. That algorithm in turn would require additional metadata for its specification, but that is something I will not address in this post. At any rate, because it is part of the metadata record, it is search-enabled in the Zenodo version.
<subjects> <subject subjectScheme="Gibbs_Energy" schemeURI="https://doi.org/10.1351/goldbook.G02629" valueURI="http://gaussian.com/thermo/">-1705.490787</subject> <subject subjectScheme="inchi" schemeURI="http://www.inchi-trust.org/">InChI=1S/C25H39NO9/c1-6-26-20-24-13-9-12-14(31-2)10-23(29,16(13)17(12)33-4)25(26,30)19(34-5)18(24)22(11-27,21(28)35-20)8-7-15(24)32-3/h12-20,27,29-30H,6-11H2,1-5H3/t12-,13-,14+,15+,16-,17+,18-,19+,20+,22+,23-,24+,25+/m1/s1</subject> <subject subjectScheme="inchikey" schemeURI="http://www.inchi-trust.org/">VELNVPXNOKVVTC-VJKZSTDTSA-N</subject> </subjects>
The context is added by addition of the attributes subjectScheme, schemeURI and valueURI. The top level context is the definition provided by the IUPAC Gold Book, and the actual implementation of the algorithm is described on the Gaussian site (although the algorithm there is not explicit in a machine implementable sense). These additions allow an indexed search not only of the numerical value (as a simple string and not as a floating point number) but which can be constrained by specifying the value of e.g. the subjectScheme so that any other random number specified as a keyword which does not have this attribute is excluded. This also allows a search where the floating point number is replaced by wild-cards (*), which would then retrieve ANY reported Gibbs energy, which could in turn be constrained by say the nature of the molecule as expressed using InChI.
<relatedIdentifiers> <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.5511965</relatedIdentifier> </relatedIdentifiers>
which relates to an earlier version of the metadata for this entry.†
<relatedIdentifiers> <relatedIdentifier relatedIdentifierType="URL" relationType="HasMetadata">https://data.hpc.imperial.ac.uk/resolve/?ore=9031</relatedIdentifier> <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart">https://data.hpc.imperial.ac.uk/resolve/?doi=9031&file=1</relatedIdentifier> <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart">https://data.hpc.imperial.ac.uk/resolve/?doi=9031&file=2</relatedIdentifier> <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart">https://data.hpc.imperial.ac.uk/resolve/?doi=9031&file=3</relatedIdentifier> <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart">https://data.hpc.imperial.ac.uk/resolve/?doi=9031&file=4</relatedIdentifier> <relatedIdentifier relatedIdentifierType="DOI" relationType="References">10.5281/zenodo.5511966</relatedIdentifier> <relatedIdentifier relatedIdentifierType="DOI" relationType="References">10.6084/m9.figshare.16685497</relatedIdentifier> <relatedIdentifier relatedIdentifierType="DOI" relationType="IsPartOf">10.14469/hpc/9158</relatedIdentifier> </relatedIdentifiers>
where a large number of related PIDs would result in a rich PID graph. These entries include relationType=”HasMetadata” which is a pointer to additional metadata expressed using a different schema (ORE) and which provides a machine-actionable manifest for the files present, specifying the Media Types of each file and a machine method of accessing them. relationType=”HasPart” provides an access URL for each specific item in the fileset. relationType=”References” is the analogue of the Figshare entries above, citing the other two repositories we are discussing here and finally relationType=”IsPartOf” indicates the deposition is part of a larger collection (in this case the collection generated for this blog) and which could also correspond to e.g. a project comprising multiple researchers at multiple institutions, or say a PhD dissertation containing multiple chapters. The extensive nature of this list of identifiers means that the PID graph would reveal many connections.
I have only covered in detail only three repositories here; more could be added to the list and analyzed for their metadata records. The bottom line is that generally the more metadata that is added, the richer the resulting services and analyses based on PIDs can become. It can only be hoped that this aspect of the operation of repositories continues to improve over time and eventually most will broadcast very rich metadata, including at the very specific subject level. This should enrich the research landscapes, especially at the finely grained subject level.
In the next post, I will analyse the results of searches enabled by this metadata.
‡Figshare also has an available API, which has not been implemented in the current version of this portal. †Policies regarding editing of metadata vary. Some repositories editing updates to the record held by DataCite against the existing DOI. Others require the generation of a new DOI for each new version of the metadata, no matter how small a change (e.g. spelling mistakes in the title etc). ♥An unsolved problem in DataCite metadata is datatypes and units. This entry is a floating point data type, with units of Hartree. How this information can be added is still being discussed. ♠The registration authority is obtained using the syntax https://doi.org/ra/10.11922/sciencedb.01522 which reveals it is the China ISTIC (http://dx.chinadoi.cn/10.11922/sciencedb.01522). I am trying to find out how to use this registration agency to recover metadata, using a service equivalent to eg https://api.crossref.org/ or https://api.datacite.org/
This post has DOI: 10.14469/hpc/9159
In an earlier post, I discussed a phenomenon known as the "anomeric effect" exhibited by…
In the mid to late 1990s as the Web developed, it was becoming more obvious…
I have written a few times about the so-called "anomeric effect", which relates to stereoelectronic…
The recent release of the DataCite Data Citation corpus, which has the stated aim of…
Following on from my template exploration of the Wilkinson hydrogenation catalyst, I now repeat this…
In the late 1980s, as I recollected here the equipment needed for real time molecular…