Henry Rzepa's Blog

Metadata. Why?

I have had some interesting discussions recently regarding metadata. What emerges is that it can be quite a broadly defined concept and it is clear that a variety of answers might be obtained when asking the simple question “what is it useful for?” Here I set out some of my answers to that question.

Metadata vs Data. Questions such as where is the continuum between data/metadata and whether the metadata is fine-grained or more broadly-grained.
What is its ultimate destination? Should metadata reside inside a complete package or container of data, serving the purpose of succinctly describing what to expect in that package? Or should it reside entirely separately from the data package in some sort of metadata store (MDS)?
Are there issues of trust or provenance? Thus, how was the metadata created, by a person or a process and when? Has it been changed since it was created? If so, what are the revisions? Does the metadata adhere to a specified structure and has it been been validated against that structure.

Some context needs to be applied before answering such questions (context is perhaps a synonym for metadata!)

Firstly, I am going to use metadata here in the context of describing data itself (i.e. rather than other research objects such as journal articles). This would include answers to questions such as:
1. who created both the data and its metadata.
2. when were both created and perhaps modified.
3. where the data is stored
4. what are its defined internal structures (sometimes also called MEDIA types).
5. who its “publisher” is (the organisation where the data was produced or is curated).
6. what are the access and re-use rights associated with the data.
These are broad-grained provenance if you like.
Next, metadata describing the specific the context of the data, e.g. in my case the chemistry associated with it.
1. Is it about a molecule?
2. if so what is the nature of the molecule?
3. Is it computational data about a molecule.
4. If so, what software was used for the computations and its parameters, inputs and outputs.
5. Might it be instrumental data recorded for a molecule?
6. If the latter, does it record the instrument and its settings?
We are now moving into fine-grained metadata, and perhaps even crossing the boundary into data itself, since the parameters for either software or instruments can be large and complex and are often so heavily mixed into the data itself that their extrication may be a challenge.
Finally, what is the purpose of creating and storing such metadata.
1. Here the context is of “discoverability” (of the data itself) and perhaps also
2. “Reusability” and/or “Interoperability (of the data itself).
3. These attributes are nicely summarised by the acronym FAIR, where discoverability is specified by both Findability and Accessibility.

Before introducing examples based on metadata with the focus on discoverability, I want to distinguish between locally packaged metadata and separated metadata (Qu. 2 above). The examples below relate purely to the latter, which has been created as a separate entity by registration with an agency such as DataCite. Such registration also addresses Qu. 3 above about trust. This external agency adds trust by recording the identity of the person (or a process or workflow initiated by a person) registering the metadata together with the registration date (the Datestamp) and also monitors any changes to the metadata (which is allowed) by keeping its version history. Interestingly, there seems to be no mechanism to record any processes or workflows used to create metadata so as to learn how the metadata itself was assembled. Nor have I seen much discussion of this aspect; one for the future I fancy.

I now introduce some examples of discoverability. The descriptions are quite short and are meant to be used in conjunction with a “reverse-engineering” of the (somewhat) human readable search query. These queries are also deposited as “data”, at DOI: 10.14469/hpc/5920

Entry	Description	Elasticsearch query
1	Media (MIME) type	https://commons.datacite.org/?query=media.media_type:chemical/x-mnpub*
2	Combining Media with the DataCite Subject	https://commons.datacite.org/?query=media.media_type:chemical/x-mnpub+AND+subjects.subjectScheme:inchikey+AND+subjects.subject:XZYDALXOGPZGNV-UHFFFAOYSA-M+AND+media.media_type:chemical/x-gaussian
3	Combining ORCID with Media	https://commons.datacite.org/?query=contributors.nameIdentifiers.nameIdentifier:0000-0002-8635-8390+AND+media.media_type:chemical/x-mnpub
4	Exploiting Subject	https://commons.datacite.org/?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:”-39.946176″
5	Exploiting Subject with range query	https://commons.datacite.org/?query=subjects.subjectScheme:Gibbs_energy+AND+subjects.subject:[\-649.1 TO \-649.8]
6	Nested search with two Subjects	https://commons.datacite.org/?query=(subjects.subjectScheme:inchikey+AND+subjects.subject:”-1082.980914″)+AND+(subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:KTOSDSJYNBIDCN-UHFFFAOYSA-N)
6	Nested search with two Subjects transposed	https://commons.datacite.org/?query=(subjects.subjectScheme:inchikey+AND+subjects.subject:KTOSDSJYNBIDCN-UHFFFAOYSA-N)+AND+(subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:”-1082.980914″)
7	Two different Media types	https://commons.datacite.org/?query=media.media_type:chemical/x-gaussian+AND+media.media_type:chemical/x-mnpub
8	License type	https://commons.datacite.org/?query=rightsList.rights:”Creative Commons Public Domain Dedication (CC0 1.0)”
9	Exploiting subjectscheme	https://commons.datacite.org/?query=media.media_type:chemical/x-mnpub*+AND+subjects.subjectScheme:NMR_Nucleus+AND+subjects.subject:1H
10	Exploiting subjectscheme	https://commons.datacite.org/?query=media.media_type:chemical/x-mnpub*+AND+subjects.subjectScheme:NMR_Pulse+AND+subjects.subject:1D
11	Simple PID query	https://commons.datacite.org/?query=identifier:10.14469/hpc
12	Combining ORCID with PID query	https://commons.datacite.org/?query=(contributors.nameIdentifiers.nameIdentifier:0000-0002-8635-8390)+AND+(identifier:10.14469/hpc*)
13	Combing researcher name with PID query	https://commons.datacite.org/?query=(id:10.14469/hpc)+AND+(contributors.name:Henry+Rzepa) https://commons.datacite.org/?query=(id:10.14469/hpc)+AND+(creators.name:Henry+Rzepa)
14	Entries in specific repository (Imperial) referencing specific Journal	https://commons.datacite.org/?query=(relatedIdentifiers.relatedIdentifier:10.1021/acs.orglett)+AND+(identifier:10.14469/hpc*)
15	Entries in specific repository (Cambridge) referencing specific Journal	https://commons.datacite.org/?query=(relatedIdentifiers.relatedIdentifier:10.1021/acs.orglett)+AND+(identifier:10.17863/cam*)
18	Entries in specific repository (Cambridge) referencing all publisher journals	https://commons.datacite.org/?query=(relatedIdentifiers.relatedIdentifier:10.1021/acs)+AND+(identifier:10.17863/cam*)
16	Entries in all repositories except one referencing specific Journal	https://commons.datacite.org/?query=(relatedIdentifiers.relatedIdentifier:10.1021/acs.orglett)+NOT+(identifier:10.5517*)
17	Entries in specific repository referencing one publisher	https://commons.datacite.org/?query=(relatedIdentifiers.relatedIdentifier:10.1021)+AND+(identifier:10.5517*)
19	Entries in all publisher journals, excluding one data repository	https://commons.datacite.org/?query=(relatedIdentifiers.relatedIdentifier:10.1021)+NOT+(identifier:10.5517*)
20	Entries in Institutional repository referencing datasets	https://commons.datacite.org/?query=(relatedIdentifiers.relatedIdentifier:10.14469/spiral)+AND+(identifier:*)+AND+(types.resourceTypeGeneral:Dataset)

The examples above reveal a somewhat a not entirely human-friendly syntax; with each of them some effort at “de-bugging” was needed to make them work. I gather from the PIDForum that a more friendly GUI to achieve this is on their radar. As I develop or discover more examples of such searches I will add them to the list above at DOI: 10.14469/hpc/5920. Meanwhile, if you want to use any of the above as a template for your own searches do please explore.

https://orcid.org/0000-0002-8635-8390

Henry Rzepa

Henry Rzepa is Emeritus Professor of Computational Chemistry at Imperial College London.

Next CH...O hydrogen bonding competing with layered dispersion attractions. »

Previous « Anniversaries: The World-Wide-Web at 30 and 25 (+ CERN's LHC as a bonus).

View Comments

Mike Turner says:

July 16, 2019 at 11:50 am

Derek Lowe's blog yesterday:
https://blogs.sciencemag.org/pipeline/archives/2019/07/15/machine-mining-the-literature
This highlights the potential of machine processing properly curated information in natural language (journal abstracts in this case) to provide useful inputs to research. If metadata could routinely stitch the two together, computers would suddenly become much more useful.
Henry Rzepa says:

July 16, 2019 at 4:27 pm

ContentMine has been doing this for a little while. Natural language (in chemistry) is at best around 95% accurate, and a fair bit more has to be done to render the results more reliable.

I agree good metadata combined with natural (trained) language searching has lots of potential. Interestingly, whereas the introduction of Google has revolutionised how humans search for information, new generations of search engine such as Elasticsearch are leading the way for embedding into AI-engines. I note that the metadata for FAIR data is indexed by DataCite using ElasticSearch. So we may well expect some revolutionary stuff based on natural language in combination with Elastic metadata to emerge in the next few years.

Metadata. Why?

View Comments

Recent Posts

Detecting anomeric effects in tetrahedral boron bearing four oxygen substituents.

Internet Archeology: reviving a 2001 article published in the Internet Journal of Chemistry.

Detecting anomeric effects in tetrahedral carbon bearing four oxygen substituents.

Data Citation – a snapshot of the chemical landscape.

Mechanistic templates computed for the Grubbs alkene-metathesis reaction.

3D Molecular model visualisation: 3 Million atoms +

Metadata. Why?

View Comments

Related Post

Recent Posts

Detecting anomeric effects in tetrahedral boron bearing four oxygen substituents.

Internet Archeology: reviving a 2001 article published in the Internet Journal of Chemistry.

Detecting anomeric effects in tetrahedral carbon bearing four oxygen substituents.

Data Citation – a snapshot of the chemical landscape.

Mechanistic templates computed for the Grubbs alkene-metathesis reaction.

3D Molecular model visualisation: 3 Million atoms +