Goldilocks Data.

Last August, I wrote about data galore, the archival of data for 133,885 (134 kilo) molecules into a repository, together with an associated data descriptor[cite]10.1038/sdata.2014.22[/cite] published in the new journal Scientific Data. Since six months is a long time in the rapidly evolving field of RDM, or research data management, I offer an update in the form of some new observations.

Firstly, 131 kilo molecules are now offered in a new different form; and it is worth comparing the differences between the presentation of the two sets of otherwise identical data.

  1. The original archive had a single assigned DOI[cite]10.6084/m9.figshare.978904[/cite] from where you could download a ZIP file to be unpacked and navigated on your own computer. The exposed metadata for the deposition (by which I mean in this case, metadata registered with DataCite, the registration authority used by Figshare) was limited to general information about the 133,885 molecules such as the authorship and license. The granularity is coarse, not extending to descriptions of individual molecules.
  2. The new version forgoes the ZIP archive, replacing it with a proper database (based on MongoDB) containing information about 130,832 molecules.  This allows one to search the data at the individual molecule level (formula, InChI descriptor, mass, etc) using the tools provided. To the end-user, this is much more useful; the data is both discoverable and re-usable.

This is no overlap between these two presentations of the data. There also appears to be no API (application programming interface) which might allow one to write code to construct one’s own searches. The apparent absence of an API also means that really only a human navigating the set menus can discover and re-use that data; the data might not be mineable by a machine for example. The absence of an API is not that unusual, only some of the best known molecular databases offer this; the RCSB Protein Data Bank is a good example. More significantly, each instance of such a molecule-based database is likely to have its own way of accessing the data and even if a documented API were available, one would still have to write specific code for each such resource.

So the first bowl contains what I suggest is cold porridge and the second is perhaps equivalent to a table d’hôte menu. Does Goldilocks have a third option? I would argue yes, she could have:

  1. We recently published data for 158 kilo molecules[cite]10.14469/ch/2[/cite] for which each molecule carries its own metadata. That metadata can be queried using any search engine that supports the basic metadata standards:
    is an example. Or armed with the metadata schema, one could also write one’s own search engine and in theory at least, that code should serve to query ANY repository that supports these standards.

You could argue that all that has happened is one has simply replaced a specific database API (if it exists) with a specific metadata schema. But these metadata schemas are controlled standards, the components of which should be self-describing (and one can see the schema components by invoking the link above).

As the archival of data (RDM) becomes increasingly important, communities will have to start making decisions about which flavour of data-porridge to offer Goldilocks. For molecular data at least, I suggest the third option is highly desirable and perhaps likely to be the most persistent. Parochial databases very much depend on a specialised team of people to maintain them in perpetuity, which I gather now means 20 years. At very least, we should start to have a debate about how the future will evolve. Let us not leave this debate merely in the hands of a small number of large organisations that are likely to make decisions based on their own business models. After all, it starts off at least as our data, not theirs! Arguably, we as authors have now largely lost control over how our stories (journal articles) are managed, let us not cede the same for data.

Tags: , ,

Leave a Reply