A (light) introductory tutorial on Research Data Management (in chemistry).

Management of research (data) outputs is a hot topic in the UK at the moment, although the topic has been rumbling for five years or more. Most research-active higher educational establishments have or are about to publish general guidelines, which predominantly take the form of aspirational targets rather than actionable examples or use-cases. Because the concepts remain somewhat abstract, one can encounter questions from researchers such as “how should I go about achieving such RDM (research data management)?” I thought it might be useful for me to here summarise some key features in the form of an FAQ that can help answer that question. I will concentrate purely on the sub-set chemistry about which I know most.

I will start by exploring the acronym FAIR data.

  • F is findable. This means that metadata is a key part of the process, since it is this information that allows the research data to be more easily found, not only by other humans but by software engines which specialise in such activity.
  • A is accessible. And easily so. Which means a standard identifier to get to the research data, with no paywalls, account registrations or other obstructions. It should ideally be possible to access data anonymously, without necessarily revealing personal information.
  • I is inter-operable. This is harder to define exactly, but the essence is that it should be possible to re-use the data in a context different from the original, and perhaps even outside the subject domain where it was created. For example, if data was collected using one specific instrument, it should be able to use it without necessarily having access to either an identical instrument or to the software associated with that instrument.
  • R is reusable. There should be sufficient information about the data and its parameters to if necessary repeat its collection independently of the original, or to re-use it to start a new data collection. Reusable also means by software, and not just by a human.

The first two properties are easily achieved, since standard procedures can be used. The last two properties are potentially more difficult, since they require more intervention or thought by both the depositor and the re-user. So I will concentrate really on the first two, since by and large they will satisfy most of the general guidelines issued by funders and universities, but note that we must not in the medium to longer term forget the last two.

I will now list some typical types of data that I have personal experience of. As the community increasingly participates in such RDM, this list will expand by “crowd-sourcing”; if your type of data is not listed, do not give up! 

  1. Data generated by software without instrumental inputs, a good example of which are the outputs of computational chemistry. I have the most personal experience in this area, having been at it for ten years or more[cite]10.1021/ci7004737[/cite],[cite]10.1021/ci500302p[/cite] and examples are scattered throughout this blog (and in many of our recent research publications).
  2. Software developed as part of the data collection process and which might be required by others to re-use the data. An example of such was described in a previous post, and has been RDMed here.[cite]10.5281/zenodo.19949[/cite].
  3. Data generated by software associated with instrumental outputs. In chemistry this means spectrometers and other instruments, most of which now have computers which handle the data outputs. Specific examples might be crystal structures, NMR, IR, MS and optical (including chiroptical) spectra.
    • Crystal structures are the gold standard in RDM, since they fulfil all the requirement of FAIR and so merit a special mention here. In the last year, the Cambridge structural database (CSD) has had implemented a standard access mechanism based on a digital object identifier (DOI).[cite]10.5517/CC11TJ7M[/cite]
    • The end point of many other instrumental outputs are PDF files. These do not easily achieve the IR of FAIR (see my comment above), but we will admit the PDF format as a temporary expedient until the use of semantically richer formats increases (the gold example here being the CIF format for crystal structures). You can see an example of PDF files here as a fileset[cite]10.6084/m9.figshare.777773[/cite] describing 1H, 13C NMR, Mass spectrometry, ECD (electronic circular dichroism) and VCD (vibrational circular dichroism). Perhaps a better format for expressing many types of spectra is the Excel spreadsheet, which achieves a reasonable proportion of the IR aspirations of FAIR. Both expressions can be included in the collection. 
    • As a postscript to this list, I should mention that instrumental data is often found as:
      • raw (unreduced or unprocessed) data, which can be very large (e.g. Free induction decay time-domain data in NMR).
      • A version which has already been subjected to processing (Fourier transformed frequency-domain data in NMR, i.e. a spectrum). This is probably more suitable for archiving, but its a fine judgement.
      • A a rough rule of thumb, chemistry data intended for archival should be ~ < 1 Gb.
  4. Synthetic methodologies that describe the preparation and characterisation of molecules. You can see an example of such data here.[cite]10.1039/SP501[/cite] 

Now I come to how the (molecular) data is packaged, and this is best described in terms of its granularity. There are perhaps four classes:

  1. All the data is packaged into a single compressed (ZIP) archive. An example can be found here[cite]10.6084/m9.figshare.978904[/cite] containing coordinates for 134,000 molecules. If your interest is in just one of these molecules, then you could argue that this data does not fully conform to the F of FAIR, since it contains no information (metadata) about individual molecules.
  2. The next packaging is (in chemistry) for a specific molecule (or perhaps reaction). An example is again[cite]10.6084/m9.figshare.777773[/cite], which contains data about a specific molecule, and that molecule is itself defined by the inclusion of e.g. a Chemdraw file. Another example[cite]10.1039/SP501[/cite] relates to reaction information, and also includes spectroscopic data in the form of a JCAMP-DX file, which is semantically preferable to eg an Excel spreadsheet or just a PDF file. Most of the examples on this blog are in this category, relating to quantum chemical computations of a specific molecule.[cite]10.14469/ch/191378[/cite] I will concentrate here just on this second type of packaging.
  3. The most finely-grained packaging is at the molecular property level. To illustrate this, go visit e.g. the Wikipedia page for aspirin, where you will find a ChemBox containing property data. In the future, these ChemBox properties will be interactively populated from a data repository known as WikiData. This type of RDM is still developing, and I include it here as a placeholder and to counterbalance the first category above!
  4. Thus category is a little different from the previous three; it relates to a collection of packages, where the granularity of class 2 above is retained, but boxed up into a project collection.[cite]10.14469/ch/2[/cite]

  And now to look at the life cycle of some data.

  1. The data starts off as live. This is some sort of holding store which members of the group can access/contribute to. It can be a local sharepoint or a cloud-based resource such as DropBox, but it could still be a simple DVD or USB storage device.
    • We have for some ten years now used a locally built live data store (which is itself archived at Zenodo as software[cite]10.5281/zenodo.19174[/cite]) and which serves to track a user’s experiments, including initiation and completion dates and times, to serve as a simple interface for archival, to record published experiments and to flag requested data embargoes (see below) and to provide a search interface for all of this. Pretty much the description of an electronic (laboratory) notebook. We created our own[cite]10.1021/ci500302p[/cite] because few commercial products (either ten years ago, or even now) offer the ability to seamlessly incorporate a Publish workflow which automates all the required actions of RDM as described here, and because it is something we might want to do 5-20 times a day. If your requirement is much less, such automation may not be needed.
  2. When the data is stable and edited down to that which needs to be associated with an article (the narrative), it now needs archiving in a manner that will ensure its persistence for at least a decade or even longer.
  3. Associated metadata describing the data also now needs to be assembled and this combined package is now sent to a data archive. These archives have special characteristics, one of which is that they can issue a persistent identifier we know as the DOI. This itself is issued by a registry, which for data is usefully done by an organisation known as DataCite. If desired, two or more of these packages can be associated with a collection, and the collection itself can also be given a DOI.[cite]10.14469/ch/2[/cite]
  4. A copy of the metadata is sent to DataCite when the DOI is issued. The search engine that indexes this information is also at DataCite.
  5. Now all that needs doing is that the Data DOIs are all cited in the article to be published, or you can (also or instead) cite the DOI for a collection. An accepted article is itself issued in due course with a DOI (this time by an agency known as CrossRef on behalf of the publisher). 
  6. To complete the virtuous cycle, the article DOIs can be retrospectively added to the metadata for each data package (or the collection of packages), ensuring that the data references the narrative, and that the narrative references the data. 
  7. You will note from the virtuous cycle in item 5, that timing becomes important. You have to archive the data and mint a DOI in order to cite it in an article. This sounds like publishing the data before the article has been accepted, which would have the advantage that referees could access it as part of their QA process for the article. However, it may be more suitable to simply reserve a DOI for the data for inclusion in an article, but not make it public until that article has itself been accepted and published. This process is called embargoing; I will defer discussion of this, because this tends to vary according to repository and its implementation is still evolving.
  8. The final action might be to register this activity on any institutional software that monitors and aggregates research outputs. We use Symplectic to achieve this, it having the ability to record both a research publication and increasingly properties of the data itself.

By now you might be asking where you could explore further, and perchance even try things out.

  1. zenodo.org/features  is one good place to start; it will cost nothing; there is (within reason) no limitation to how much data can be archived. Zenodo also allows data to be retrieved from DropBox and Github (for code) for archival.
  2. figshare.com allows you to sign up for free, but with limitations to the total data storage unless you upgrade to an institutional or paid account.
  3. www.datadryad.org/pages/faq  which charges $80-90 per deposition.
  4. Institutional data repositories. The notes above were written based on the experiences we have had for almost nine years now with a local data repository we call SPECTRa,[cite]10.1021/ci7004737[/cite] where some 230,000 individual data packages are now archived. This one[cite]10.14469/ch/46[/cite] dates from 2007 to illustrate its longevity. Unfortunately, only members of  Imperial College can make use of it.

I realise now that I have written this all down that it is somewhat longer than I was expecting, and that this very length may well put some researchers off. Apart from RDM now being mandatory in the UK, it is also reasonable for researchers to ask “what was in it for me?” as a reward for persisting. I can only answer that one from my personal experiences:

  • The live data store (or uportal as we call it) has proved invaluable for recording our (computational) experiments. I often use it to track down calculations from years ago. As a laboratory notebook, it is minimalist, as is the learning curve and hence does not overwhelm. If more information is needed, one simply goes to the DOI recorded there for each experiment if archived, or the original inputs and outputs if not.
  • Assigning a DOI to a data package makes it really easy to share this with both collaborators and other researchers who express interest (the data is often too large to send by email).
  • Sometimes I use e.g. search.labs.datacite.org/help/examples to search the metadata created during the process in order to find (F) and access (A) old data, which is then very quickly amenable to re-use (R). OK, SciFinder or Reaxys it is not (yet!), but it is getting there.
  • One can get access statistics for the data. If you click on the link, you can see some datasets have been accessed more than 200 times. Someone must be finding them valuable! If you want to find out how much (UK) data is searchable in this manner, click here. Perhaps such statistics may even help get you promoted one day!
  • Having data available in this way enables one to construct more interesting tables or figures. This “figable” (yes, its both a table and a figure) comes from a recent publication of ours.[cite]10.6084/m9.figshare.1299202[/cite] It retrieves the data purely by its DOI and inserts it into display software (JSmol) to construct an instant molecular model. One can also use this approach for lecture notes and labs,[cite]10.1021/ed500398e[/cite] for blogs as here, and (if you are very brave) for research presentations.
  • Google Scholar detects data and citations to it equally with journal articles. This is part of my profile there, and there you can see both articles AND data. If you are keen-eyed, you will however note that the data does not contribute to my h-index (but arguably, it is more valuable to have some data sets accessed 200+ times rather than to be cited!).

Some selected use-case examples can be viewed,[cite]10.6084/m9.figshare.1476832[/cite] along with one specific to computational chemistry[cite]10.6084/m9.figshare.1477994[/cite].

4 Responses to “A (light) introductory tutorial on Research Data Management (in chemistry).”

  1. Brian says:

    Thank you for the overview! Obviously this would be rather field-specific, but is there work in progress to define what metadata should be added, what keys should be used for each piece of metadata, and syntax for the metadata fields? For example, when archiving a computational result, are there standard keys for method, functional, basis set, chemical species, etc.?

  2. Henry Rzepa says:


    Yes, there is more to this than an overview could include. Regarding the syntax, that is largely defined by the DataCite schema, and how their API is implemented at the repository end of things. There is also work on meta-metadata schemas, i.e. schemas describing metadata.

    As for the metadata dictionary, the items your list are certainly very finely grained and we are nowhere close to implementing at that level of detail. It is difficult to estimate the size of the dictionary required for computational chemistry, never mind chemistry as a whole. Ideally an organisation such as IUPAC (and its GOLD book for chemistry) should oversee this, but they do move at glacial speeds. I did hear of some NSF initiatives for inter-operability in computational chemistry codes (the I of FAIR), but am not up to date on that. If anyone has any more details, please post.

  3. Henry Rzepa says:

    Re: Marcel Swart (Eng.) ‏@Marcel_Swart 19h19 hours ago
    @hrzepa did you already know about the http://iochem-bd.org project? Useful for #compchem outputs

    This project is a successor to an earlier one based on using the CML markup for chemistry and Peter Murray-Rust’s JUMBO converts from legacy formats into CML. Thanks for pointing it out Marcel!

    The use of Chemical Markup Language as a semantically rich and “self-describing” language for expressing chemical properties is actually 19 years old now (DOI: 10.1021/ci990052b) and its use continues to slowly increase. Its description and use in the context pointed out by Marcel was something I had pencilled in for the advanced tutorial on RDM that hopefully will appear here in the near future.

  4. Henry Rzepa says:

    An after thought to Brian’s question about “are there standard keys for method, functional, basis set, chemical species, etc.?”

    The answer to the question about metadata for chemical species is certainly nowadays the InChI string and the InChI key (for discrete molecules, not materials and polymers).

    When one tunnels down to eg method, functional, basis set in a quantum mechanical sense one starts to approach the boundary between metadata and data itself. If one starts to eg specify basis set or functional in the metadata, one is effectively creating a database of properties for the molecule. This is starting to resemble the third category of granularity/packaging I noted above, where all identifiable properties associated with the subject (the predicates) have an identifiable object (the value of the property). At this sort of level, these constructs are stored in special repositories (sometimes called triple stores) and this is the Wikipedia/Wikidata semantic model I noted very briefly above. At what point one abandons the metadata-data model and goes over to the property model remains up for discussion.

    If we adopt the model that all useful data is described with at least minimal metadata, assigned a DOI and that is sent to DataCite in accordance with their schema, then we at least have a mechanism for preventing too much fragmentation. Whether the community will choose to do this is a different issue.

Leave a Reply