New generations of globally aggregating search engines – for (chemical) data.

Chemists have long been familiar with search engines that aspire to index a large proportion of the chemical literature. Think for example the old-generation (and commercial) SciFinder (Scholar) and Reaxys or those that arrived in the 1990s in the online era such as the non-commercial Pubchem or ChemSpider (there are more). But you may not be as familiar with the latest generation of global search engines and here I will focus on three relatively new ones that specialise specifically in tracking down data rather than just publications.

I will illustrate first using a regular or non-advanced search. The keyword will be obtusallene, which is selected largely because it is a relatively unique string which is likely to result in fewer false positives. It is a family of marine alkaloids containing, unusually, bromine and /or chlorine[1] and the citation here is to a journal article describing some of its chemistry. But what if you want to find data associated with such molecules?

  1. DataCite (the name gives a clue) specialises in finding data. It was launched ten years ago and has been rapidly expanding its index since. A regular search can be formulated using the string

    As these three advanced queries imply, there are many more ways of constraining the search, which I will describe at a later time.

  2. A more recent introduction is DataSetSearch from Google.
    • https://datasetsearch.research.google.com/search?query=obtusallene (20 hits). Google cites as its sources DataCite itself and the specific repository Figshare (for this search query). 
    • Which leaves a slight mystery. Whilst there is considerable overlap between the DataCite and Google searches, the latter should clearly be potentially a superset of the former, but in fact it is slightly less comprehensive (by at least 5 hits).
  3. My third new engine is OpenAIRE (a European project supporting Open Science). It is also the search engine provided by Zenodo.
    • https://explore.openaire.eu/search/find?keyword=obtusallene (20 hits on research data, 6 hits on publications, 5 hits on “other research products” and zero hits on “software”).
    • Which introduces not just data but other concepts associated with “research objects”, clearly more useful than data alone. One of these may well shortly be Instruments (as eg used to acquire data) and another is e.g. the software used to analyze the data.

I think these new-generation search engines specialising in data have lots of exciting potential. They are still maturing and I hope we will see some interesting new capabilities emerge which we have not had before.


All are on-line nowadays, but engines such as SciFinder had two previous existences, from about 1980 as CAS online using merely a terminal interface, and prior to that as printed copies to be searched manually.

References

  1. J. Clarke, K.J. Bonney, M. Yaqoob, S. Solanki, H.S. Rzepa, A.J.P. White, D.S. Millan, and D.C. Braddock, "Epimeric Face-Selective Oxidations and Diastereodivergent Transannular Oxonium Ion Formation Fragmentations: Computational Modeling and Total Syntheses of 12-Epoxyobtusallene IV, 12-Epoxyobtusallene II, Obtusallene X, Marilzabicycloallene C, and Marilzabicycloallene D", The Journal of Organic Chemistry, vol. 81, pp. 9539-9552, 2016. http://dx.doi.org/10.1021/acs.joc.6b02008
Henry Rzepa

Henry Rzepa is Emeritus Professor of Computational Chemistry at Imperial College London.

View Comments

Recent Posts

Internet Archeology: reviving a 2001 article published in the Internet Journal of Chemistry.

In the mid to late 1990s as the Web developed, it was becoming more obvious…

1 month ago

Detecting anomeric effects in tetrahedral carbon bearing four oxygen substituents.

I have written a few times about the so-called "anomeric effect", which relates to stereoelectronic…

1 month ago

Data Citation – a snapshot of the chemical landscape.

The recent release of the DataCite Data Citation corpus, which has the stated aim of…

2 months ago

Mechanistic templates computed for the Grubbs alkene-metathesis reaction.

Following on from my template exploration of the Wilkinson hydrogenation catalyst, I now repeat this…

2 months ago

3D Molecular model visualisation: 3 Million atoms +

In the late 1980s, as I recollected here the equipment needed for real time molecular…

3 months ago

The Macintosh computer at 40.

On 24th January 1984, the Macintosh computer was released, as all the media are informing…

3 months ago