New generations of globally aggregating search engines – for (chemical) data.

Chemists have long been familiar with search engines that aspire to index a large proportion of the chemical literature. Think for example the old-generation (and commercial) SciFinder (Scholar) and Reaxys or those that arrived in the 1990s in the online era such as the non-commercial Pubchem or ChemSpider (there are more). But you may not be as familiar with the latest generation of global search engines and here I will focus on three relatively new ones that specialise specifically in tracking down data rather than just publications.

I will illustrate first using a regular or non-advanced search. The keyword will be obtusallene, which is selected largely because it is a relatively unique string which is likely to result in fewer false positives. It is a family of marine alkaloids containing, unusually, bromine and /or chlorine[1] and the citation here is to a journal article describing some of its chemistry. But what if you want to find data associated with such molecules?

  1. DataCite (the name gives a clue) specialises in finding data. It was launched ten years ago and has been rapidly expanding its index since. A regular search can be formulated using the string

    As these three advanced queries imply, there are many more ways of constraining the search, which I will describe at a later time.

  2. A more recent introduction is DataSetSearch from Google.
    • (20 hits). Google cites as its sources DataCite itself and the specific repository Figshare (for this search query). 
    • Which leaves a slight mystery. Whilst there is considerable overlap between the DataCite and Google searches, the latter should clearly be potentially a superset of the former, but in fact it is slightly less comprehensive (by at least 5 hits).
  3. My third new engine is OpenAIRE (a European project supporting Open Science). It is also the search engine provided by Zenodo.
    • (20 hits on research data, 6 hits on publications, 5 hits on “other research products” and zero hits on “software”).
    • Which introduces not just data but other concepts associated with “research objects”, clearly more useful than data alone. One of these may well shortly be Instruments (as eg used to acquire data) and another is e.g. the software used to analyze the data.

I think these new-generation search engines specialising in data have lots of exciting potential. They are still maturing and I hope we will see some interesting new capabilities emerge which we have not had before.

All are on-line nowadays, but engines such as SciFinder had two previous existences, from about 1980 as CAS online using merely a terminal interface, and prior to that as printed copies to be searched manually.


  1. J. Clarke, K.J. Bonney, M. Yaqoob, S. Solanki, H.S. Rzepa, A.J.P. White, D.S. Millan, and D.C. Braddock, "Epimeric Face-Selective Oxidations and Diastereodivergent Transannular Oxonium Ion Formation Fragmentations: Computational Modeling and Total Syntheses of 12-Epoxyobtusallene IV, 12-Epoxyobtusallene II, Obtusallene X, Marilzabicycloallene C, and Marilzabicycloallene D", The Journal of Organic Chemistry, vol. 81, pp. 9539-9552, 2016.

One Response to “
New generations of globally aggregating search engines – for (chemical) data.

  1. Henry Rzepa says:

    A well-hidden secret for some search engines at least is what is rather intimidatingly referred to as advanced search Thus with Google, you have and also search operators (described at ) which enhance the regular searches for websites. This has been joined by (for images) where you can control fields such as Size, Aspect ratio, Color, Type (face, animated, etc.), Site or domain, Filetype, SafeSearch, Usage rights (find images that you have permission to use). Some of this latter category also might come in useful for data.

    I have asked Google if such an advanced version of their data search might exist, to match the equivalent searches possible at DataCite.

Leave a Reply