A comparison of descriptive metadata across different data repositories.

The number of repositories which accept research data across a wide spectrum of disciplines is on the up. Here I report the results of conducting an experiment in which chemical modelling data was deposited in six such repositories and comparing the richness of the metadata describing the essential properties of the six depositions.

The repositories are as follows:

  1. Figshare as a repository dates from 2012. The computational chemistry dataset used was manually uploaded. Most of the metadata was entered manually by copy/paste operations and included three keywords which comprised the InChI key for the molecule, the corresponding InChI string and the calculated Gibbs Energy obtained from the computed vibrational frequencies.
  2. Zenodo started in 2013 and has been updated several times since then. The same data and metadata were used as for Figshare, including the the same keywords, but with the difference that the upload was not manual but automated using the Zenodo API as implemented in the new computational portal described in the previous post (DOI: 10.14469/hpc/9010). Publication here was a simple button click and so is a much shorter process than that for Figshare.
  3. The original 2006 version of the Imperial College data repository was based on DSpace, and updated to version 2 in 2016 with entirely new code. It too is populated by publication from the same portal as used for Zenodo.
  4. Mendeley data, launched in 2015 and now part of the Digital Commons repository family.
  5. Harvard Dataverse, which started around 2006.
  6. ScienceDB, a product of the Network Information Center, Chinese Academy of Sciences. The deposition requires peer review, a process taking ~2-3 weeks.
  7. DataDryad. This repository charges $120 per deposition and was not evaluated here (Waivers are granted for submissions originating from researchers based in countries classified by the World Bank as low-income or lower-middle-income economies)

Each deposition results in the generation of a DOI, and these, together with the link that allows access to the associated metadata can be seen in the table below.

Repository Dataset DOI Dataset
Figshare 10.6084/m9.figshare.16685497 XML
Zenodo 10.5281/zenodo.5511966 XML
Imperial College 10.14469/hpc/9031 XML
Harvard Dataverse 10.7910/DVN/4BWOYK XML
Mendeley Data 10.17632/dgtvds3xn5.1 XML
ScienceDB 10.11922/sciencedb.01522 XML

I would note that manual deposition can be rather dependent on how fastidious the depositor is and how they interpret the descriptive keywords that Figshare and Zenodo accept. Automated deposition is a more controlled process, in which the required keywords are a property programmed into the submitting portal tool. Such a process also allows metadata to describe relationships between different datasets, such as a dataset collection, and is inherited from project descriptor on the portal. Additionally, the automated process can then be augmented by manual editing of the metadata record, as for example, the addition of the DOI for this descriptive post which can be added to the metadata records retrospectively. In the case of e.g. Zenodo, retrospective changes to the metadata record require a new DOI to be generated to reflect the changes. 

You can inspect the results of these three depositions yourself by downloading the respective metadata records and viewing the downloaded file using a simple text or XML editor. 

  1. All the first three repositories contain the ORCID of the depositor, as e.g. from Figshare:
    <creatorName>Rzepa, Henry S.</creatorName> 
    <givenName>Henry S.</givenName> 
    <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="https://orcid.org">

    The widespread addition of the unique ORCID researcher identifier is very welcome.

  2. The more interesting component is keyword metadata, populated manually in Figshare and using the automated API in the other two repositories.
    1. Below is the Figshare metadata entry, which displays the assigned categories (from a controlled list) in the <subject> container:
          <subject>Computational Chemistry</subject>
          <subject>Organic Chemistry</subject>
          <subject subjectScheme="Fields of Science and Technology (FOS)" schemeURI="http://www.oecd.org/science/inno/38235147.pdf">FOS: Chemical sciences</subject>

      The context of these keywords is clearly defined by the value of the subjectScheme (chemical sciences) but this term is very broad and does not relate very specifically to the deposited data. The more chemically specific keywords themselves are only displayed on the landing page for the entry as shown below and are not expressed in any metadata container, which means that they are not indexed and hence searchable using the DataCite metadata store.

    2. Zenodo interpret this differently, with the keywords now included in the <subject> container.

      However, you might be wondering what the keyword -1705.490787 is all about. Put simply, in this form of expression it has absolutely no context. I previously explained why it might be useful if context is added, it being a persistent identifier for (some) quantum chemical calculations in the form of a computed total energy corrected thermally into a Gibbs energy. The persistence in this case is acquired not by registration with an agency but generation by an algorithm. That algorithm in turn would require additional metadata for its specification, but that is something I will not address in this post. At any rate, because it is part of the metadata record, it is search-enabled in the Zenodo version.

    3. Imperial follows the Zenodo approach, with further addition of context:
          <subject subjectScheme="Gibbs_Energy" schemeURI="https://doi.org/10.1351/goldbook.G02629" valueURI="http://gaussian.com/thermo/">-1705.490787</subject>
          <subject subjectScheme="inchi" schemeURI="http://www.inchi-trust.org/">InChI=1S/C25H39NO9/c1-6-26-20-24-13-9-12-14(31-2)10-23(29,16(13)17(12)33-4)25(26,30)19(34-5)18(24)22(11-27,21(28)35-20)8-7-15(24)32-3/h12-20,27,29-30H,6-11H2,1-5H3/t12-,13-,14+,15+,16-,17+,18-,19+,20+,22+,23-,24+,25+/m1/s1</subject>
          <subject subjectScheme="inchikey" schemeURI="http://www.inchi-trust.org/">VELNVPXNOKVVTC-VJKZSTDTSA-N</subject>

      The context is added by addition of the attributes subjectScheme, schemeURI and valueURI. The top level context is the definition provided by the IUPAC Gold Book, and the actual implementation of the algorithm is described on the Gaussian site (although the algorithm there is not explicit in a machine implementable sense). These additions allow an indexed search not only of the numerical value (as a simple string and not as a floating point number) but which can be constrained by specifying the value of e.g. the subjectScheme so that any other random number specified as a keyword which does not have this attribute is excluded. This also allows a search where the floating point number is replaced by wild-cards (*), which would then retrieve ANY reported Gibbs energy, which could in turn be constrained by say the nature of the molecule as expressed using  InChI. 

  3. The final aspect of the metadata analysed here is the relatedIdentifier record. This is increasingly recognised as a crucial component for the construction of so-called PID graphs, which are generated to reveal connections between entities in the research landscape such as data, people, organisations, funders, publications and any other object that is assigned a registered PID (such as perhaps in the future connecting data to its origins from a large instrument). So here are these records for the three repositories:
    1. Although the landing page for the Figshare record has three such entries, including pointers to the other two depositions being discussed here, they are not propagated to the metadata record and so cannot participate in any generated PID graph.
    2. Zenodo has the following record
          <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.5511965</relatedIdentifier>

      which relates to an earlier version of the metadata for this entry.

    3. The Imperial record is:
          <relatedIdentifier relatedIdentifierType="URL" relationType="HasMetadata">https://data.hpc.imperial.ac.uk/resolve/?ore=9031</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart">https://data.hpc.imperial.ac.uk/resolve/?doi=9031&file=1</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart">https://data.hpc.imperial.ac.uk/resolve/?doi=9031&file=2</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart">https://data.hpc.imperial.ac.uk/resolve/?doi=9031&file=3</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart">https://data.hpc.imperial.ac.uk/resolve/?doi=9031&file=4</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="DOI" relationType="References">10.5281/zenodo.5511966</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="DOI" relationType="References">10.6084/m9.figshare.16685497</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="DOI" relationType="IsPartOf">10.14469/hpc/9158</relatedIdentifier>

      where a large number of related PIDs would result in a rich PID graph. These entries include relationType=”HasMetadata” which is a pointer to additional metadata expressed using a different schema (ORE) and which provides a machine-actionable manifest for the files present, specifying the Media Types of each file and a machine method of accessing them. relationType=”HasPart” provides an access URL for each specific item in the fileset. relationType=”References”  is the analogue of the Figshare entries above, citing the other two repositories we are discussing here and finally relationType=”IsPartOf” indicates the deposition is part of a larger collection (in this case the collection generated for this blog) and which could also correspond to e.g. a project comprising multiple researchers at multiple institutions, or say a PhD dissertation containing multiple chapters. The extensive nature of this list of identifiers means that the PID graph would reveal many connections.

I have only covered in detail only three repositories here; more could be added to the list and analyzed for their metadata records. The bottom line is that generally the more metadata that is added, the richer the resulting services and analyses based on PIDs can become. It can only be hoped that this aspect of the operation of repositories continues to improve over time and eventually most will broadcast very rich metadata, including at the very specific subject level. This should enrich the research landscapes, especially at the finely grained subject level.

In the next post, I will analyse the results of searches enabled by this metadata.

Figshare also has an available API, which has not been implemented in the current version of this portal. Policies regarding editing of metadata vary. Some repositories editing updates to the record held by DataCite against the existing DOI. Others require the generation of a new DOI for each new version of the metadata, no matter how small a change (e.g. spelling mistakes in the title etc). An unsolved problem in DataCite metadata is datatypes and units. This entry is a floating point data type, with units of Hartree. How this information can be added is still being discussed. The registration authority is obtained using the syntax https://doi.org/ra/10.11922/sciencedb.01522 which reveals it is the China ISTIC (http://dx.chinadoi.cn/10.11922/sciencedb.01522). I am trying to find out how to use this registration agency to recover metadata, using a service equivalent to eg https://api.crossref.org/ or https://api.datacite.org/

This post has DOI: 10.14469/hpc/9159


Leave a Reply