One molecule, one identifier: Viewing molecular files from a digital repository using metadata standards.

In the beginning (taken here as prior to ~1980) libraries held five-year printed consolidated indices of molecules, organised by formula or name (Chemical abstracts). This could occupy about 2m of shelf space for each five years. And an equivalent set of printed volumes from the Beilstein collection. Those of us who needed to track down information about molecules prior to ~1980 spent many an afternoon (or indeed a whole day) in the libraries thumbing through these weighty volumes. Fast forward to the present, when (closed) commercial databases such as SciFinder, Reaxys and CCDC offer information online for around 100 million molecules (CAS indicates it has 89,506,154 today for example). These have been joined by many open databases (e.g. PubChem). All these sources of molecular information have their own way of accessing individual entries, and the wonderful program Jmol (nowadays JSmol) has several of these custom interfaces programmed in. Here I describe some work we have recently done[1] on how one might generalise access to an individual molecule held in what is now called a digital data repository.

Such repositories are gradually becoming more common. Unlike most (all?) of the bespoke molecular repositories noted above, metadata (XML) resourcemap standards have been developed[2] for data repositories to enable rich and open searches and to help in the discoverability of individual entries (e.g. OAI-ORE). Each dataset is characterised by a DOI (digital object identifier), just like individual articles found in a conventional journal. However, there is an issue in quoting just a conventional DOI to describe a dataset. The DOI points to what is called the article landing page in the journal. A landing page which by and large is meant to be navigated by a human. To get a flavour for how this works (or more accurately does not work) for data, visit this DOI[3] for an entry in the CCDC crystal database noted above (and about which I have previously blogged). In essence, a human is needed to complete the requested information in order to proceed to retrieving the data. Data, I contend here, should not need a landing page. It can benefit from being passed straight on to e.g. a visualising program such as JSmol. So a mechanism is needed to encapsulate any bespoke (and potentially changeable) access path to the data by expressing it instead in standard metadata form.

In our first solution to this issue, and the one illustrated here, we used a standard known as 10320/loc[2]. A datafile need only be specified by its DOI (or more generically, its handle) to be recovered from the data repository; no landing page need be involved (and no human need ponder what next to do with the data).

  1. First, let me reference a molecule (as it happens the one described in the preceding post), using the normal invocation[4]. This will take you to a conventional landing page.
  2. The next example is the same dataset, but this time with the landing page replaced by a Javascript/JSmol wrapping. This is achieved using a utility which is itself packaged up and placed on a repository (shortdoi: vjj)[5], and which is embedded here for you to try out. If you want the technical detail, read about it here.[1]

There is more to come. But you will have to wait for part 2!

References

  1. M.J. Harvey, N.J. Mason, and H.S. Rzepa, "Digital Data Repositories in Chemistry and Their Integration with Journals and Electronic Notebooks", Journal of Chemical Information and Modeling, vol. 54, pp. 2627-2635, 2014. http://dx.doi.org/10.1021/ci500302p
  2. "DOI Name 10320/loc Values"http://doi.org/10320/loc
  3. Jana, Anukul., Omlor, Isabell., Huch, Volker., Rzepa, Henry S.., and Scheschkewitz, David., "CCDC 967887: Experimental Crystal Structure Determination", 2014. http://dx.doi.org/10.5517/CC11H55W
  4. Henry S. Rzepa., Nicholas Mason., and M J Harvey., "Retrieval and display of Gaussian log files from a digital repository", 2014. http://dx.doi.org/10.6084/m9.figshare.1164282
Henry Rzepa

Henry Rzepa is Emeritus Professor of Computational Chemistry at Imperial College London.

Recent Posts

Exploring Methanetriol – “the Formation of an Impossible Molecule”

What constitutes an "impossible molecule"? Well, here are two, the first being the topic of…

12 hours ago

Detecting anomeric effects in tetrahedral boron bearing four oxygen substituents.

In an earlier post, I discussed a phenomenon known as the "anomeric effect" exhibited by…

2 weeks ago

Internet Archeology: reviving a 2001 article published in the Internet Journal of Chemistry.

In the mid to late 1990s as the Web developed, it was becoming more obvious…

2 months ago

Detecting anomeric effects in tetrahedral carbon bearing four oxygen substituents.

I have written a few times about the so-called "anomeric effect", which relates to stereoelectronic…

2 months ago

Data Citation – a snapshot of the chemical landscape.

The recent release of the DataCite Data Citation corpus, which has the stated aim of…

3 months ago

Mechanistic templates computed for the Grubbs alkene-metathesis reaction.

Following on from my template exploration of the Wilkinson hydrogenation catalyst, I now repeat this…

3 months ago