Way back in the late 1980s or so, research groups in chemistry started to replace the filing of their paper-based research data by storing it in an easily retrievable digital form. This required a computer database and initially these were accessible only on specific dedicated computers in the laboratory. These gradually changed from the 1990s onwards into being accessible online, so that more than one person could use them in different locations. At least where I worked, the infrastructures‡ to set up such databases were mostly not then available as part of the standard research provisions and so had to be installed and maintained by the group itself. The database software took many different forms and it was not uncommon for each group in a department to come up with a different solution that suited its needs best. The result was a proliferation of largely non-interoperable solutions which did not communicate with each other. Each database had to be searched locally and there could be ten or more such resources in a department. The knowledge of how the system operated also often resided in just one person, which tended to evaporate when this guru left the group.
After the millennium, two newcomers started to appear, one being called an ELN (electronic laboratory notebook) and the second a data repository. The first was a heavily customised database containing research data as obtained from instruments, computers, images/video, chemical structure drawings etc. ELNs, even to this day, have limitations of interoperability with other ELNs and the contents of an ELN are often closed, requiring authentication credentials to access. The data repository also started to appear in chemistry around this period. Even in its early incarnations, it could be associated with an ELN “front end” as part of the data pipeline; an early example of this coupling is described here.[1] Another key phrase that became associated with repositories starting around 2014 was the concept of FAIR, including ideas such as the Findability (discoverability) and Interoperablity of data,† a theme often explored and illustrated on this blog.
These last seventeen years has seen organisations such as funding agencies and publishers increasingly mandating the use of such data management methods, using either a repository on its own or a combination of an ELN and repository as routine operations in research activity and publication processes. The close coupling of an ELN and repository is still however uncommon.
A colleague recently alerted me to a computational chemistry repository first launched in 2014; www.iochem-bd.org Reading the about text, I found these statements;
So this is both a database and a data repository, as well as espousing a commendable common data standard![2] I decided to explore the first two aspects here using this resource as an example.
This example has served to highlight just a few of the often quite subtle distinctions between eg a database and a data repository and that some examples can indeed be both. It also highlights that repositories can have the attributes of FAIR, which in themselves are driven by asking “what could a machine do to obtain data?”♥ rather than what could a human achieve by browsing. So another question that arises when evaluating the characteristics of a repository is whether each item held there has a FAIR-enabling metadata record describing the data, a record which is registered in a manner that can be aggregated and hence used to find and access content across multiple independent repositories.
This post has DOI 10.14469/hpc/10043
‡Indeed in that era, few online/Internet infrastructures were available as part of departmental resources. See also here. †In this last regard, I note a workshop devoted largely to such interoperability and machine access in chemistry coming up soon; https://www.cecam.org/workshop-details/1165 ♠The CrossRef schema is not referenced using an assigned DOI: data.crossref.org/reports/help/schema_doc/5.3.1/.♥An example can be seen at doi: 10.14469/hpc/10059 Here, invoking a hyperlink based purely on the data DOI and the data media type required in turn calls code (Javascript) which retrieves the metadata held for that DOI and parses it to identify whether it indicates the presence of a file manifest. If it does, it identifies the type of manifest (ORE in this case) and the media types the manifest points to and finally uses that manifest to then retrieve data filtered by media type and pipes it into a visualiser (JSmol). In this case the endpoint is visualisation, but it could also be eg piped into an AI/ML program for analysis. In this case only one instance of data is machine retrieved, but in principle it could be a multitude of data files obtained from a multitude of different locations and based on a multitude of criteria as filtered by suitable searches of registered metadata.[4]
In the mid to late 1990s as the Web developed, it was becoming more obvious…
I have written a few times about the so-called "anomeric effect", which relates to stereoelectronic…
The recent release of the DataCite Data Citation corpus, which has the stated aim of…
Following on from my template exploration of the Wilkinson hydrogenation catalyst, I now repeat this…
In the late 1980s, as I recollected here the equipment needed for real time molecular…
On 24th January 1984, the Macintosh computer was released, as all the media are informing…