One molecule, one identifier: Viewing molecular files from a digital repository using metadata standards.

In the beginning (taken here as prior to ~1980) libraries held five-year printed consolidated indices of molecules, organised by formula or name (Chemical abstracts). This could occupy about 2m of shelf space for each five years. And an equivalent set of printed volumes from the Beilstein collection. Those of us who needed to track down information about molecules prior to ~1980 spent many an afternoon (or indeed a whole day) in the libraries thumbing through these weighty volumes. Fast forward to the present, when (closed) commercial databases such as SciFinder, Reaxys and CCDC offer information online for around 100 million molecules (CAS indicates it has 89,506,154 today for example). These have been joined by many open databases (e.g. PubChem). All these sources of molecular information have their own way of accessing individual entries, and the wonderful program Jmol (nowadays JSmol) has several of these custom interfaces programmed in. Here I describe some work we have recently done[cite]10.1021/ci500302p[/cite] on how one might generalise access to an individual molecule held in what is now called a digital data repository.

Such repositories are gradually becoming more common. Unlike most (all?) of the bespoke molecular repositories noted above, metadata (XML) resourcemap standards have been developed[cite]http://doi.org/10320/loc[/cite] for data repositories to enable rich and open searches and to help in the discoverability of individual entries (e.g. OAI-ORE). Each dataset is characterised by a DOI (digital object identifier), just like individual articles found in a conventional journal. However, there is an issue in quoting just a conventional DOI to describe a dataset. The DOI points to what is called the article landing page in the journal. A landing page which by and large is meant to be navigated by a human. To get a flavour for how this works (or more accurately does not work) for data, visit this DOI[cite]10.5517/CC11H55W[/cite] for an entry in the CCDC crystal database noted above (and about which I have previously blogged). In essence, a human is needed to complete the requested information in order to proceed to retrieving the data. Data, I contend here, should not need a landing page. It can benefit from being passed straight on to e.g. a visualising program such as JSmol. So a mechanism is needed to encapsulate any bespoke (and potentially changeable) access path to the data by expressing it instead in standard metadata form.

In our first solution to this issue, and the one illustrated here, we used a standard known as 10320/loc[cite]http://doi.org/10320/loc[/cite]. A datafile need only be specified by its DOI (or more generically, its handle) to be recovered from the data repository; no landing page need be involved (and no human need ponder what next to do with the data).

  1. First, let me reference a molecule (as it happens the one described in the preceding post), using the normal invocation[cite]10042/31018[/cite]. This will take you to a conventional landing page.
  2. The next example is the same dataset, but this time with the landing page replaced by a Javascript/JSmol wrapping. This is achieved using a utility which is itself packaged up and placed on a repository (shortdoi: vjj)[cite]10.6084/m9.figshare.1164282[/cite], and which is embedded here for you to try out. If you want the technical detail, read about it here.[cite]10.1021/ci500302p[/cite]

There is more to come. But you will have to wait for part 2!

Tags: ,

Leave a Reply