Tom recently emailed me this question: Do you know how to find out how many of the compounds that appear in the chemical literature are mentioned just once? Intrigued, I first set out to find out how many substances, as Chemical Abstracts refers to the them, there were as of 5 June, 2025. There is a static estimate here (219 million), but to get the most up to date information, I asked CAS directly. They responded immediately (thanks Lee!) with 294,778,693 on the date mentioned above. It is not actually possible to answer the first question itself using CAS SciFinder, but again CAS came up with a value: “there are 113,383,649 substances in CAS Registry with only one CAplus citation” equivalent to “38.5% of the current substances have only 1 reference.” I should add this estimate was qualified by “that can be misleading, since that includes salts, multicomponents, etc. But that’s a first pass.” I am actually impressed that as many as 61.5% are mentioned more than once, since before learning the answer, I had intuitively guessed that percentage as being much lower.
Archive for the ‘Chemical IT’ Category
How many of the compounds that appear in the chemical literature are mentioned just once?
Friday, June 6th, 2025Referencing and citing a science-based blog post.
Tuesday, April 8th, 2025Back in early 2012, I pondered about the relationships between a science-based blog post and a science-based journal article[1]. This was in part induced by my discovering a blog plugin called Kcite, which allow a journal articles to be appended to the blog in the form of a numbered reference list. The only required input for Kcite was the DOI of the article (as you can see earlier in this paragraph). For around 500 posts after that moment, I always strove to add such references to my posts. Around 2016, I started including references to data in the form of repository DOIs to sit alongside the journal references, but this feature stopped working a year or two later because of changes in the metadata resolved by the DOI. Kcite itself lasted until January 2024 for this blog, when a required update to the software running the blog (WordPress) meant that it no longer worked and had to be removed as a plugin. Two years ago, Rogue Scholar (Science blogging on steroids) started coming along to the rescue.[2] ,[3] It provides some amazing automated features and infrastructure to blogs; I will illustrate from those listed on the top page of Rogue Scholar itself: (more…)
References
- H. Rzepa, "The blog post as a scientific article: citation management", 2012. https://doi.org/10.59350/3pbz1-vcd67
- M. Fenner, "Automatically list all your publications in your blog", 2013. https://doi.org/10.53731/axtz227-73n18e7
- M. Fenner, "Rogue Scholar now shows citations of science blog posts", 2025. https://doi.org/10.53731/4bvt3-hmd07
Crystallography meets DFT Quantum modelling.
Monday, March 17th, 2025X-ray crystallography is the technique of using the diffraction of x-rays by the electrons in a molecule to determine the positions of all the atoms in that molecule. Quantum theory teaches us that the electrons are to be found in shells around the atomic nuclei. There are two broad types, the outermost shell (also called the valence shell) and all the inner or core shells. The density of the core electrons is much higher (more compact) than the more diffuse valence shell for all but the hydrogen atom, which only has valence electrons. How does this relate to x-ray diffraction by electrons? Well, core electrons, because of their relative compactness, diffract X-rays more strongly than the valence electrons. This compactness of the core also means that its electron density distribution can be well (but not exactly) approximated by a sphere, with the nucleus at the centre of that sphere. And from this it follows that the density for each atom can be treated independently, the so-called IAM or independent atom model. For example all the carbon atoms in a molecule are approximated as having the same value for the electron density of their core shell. But the IAM approximation is much less good for hydrogen atoms, especially when they are attached to very polar atoms (Li, O, F, etc) and even atoms such as carbon or oxygen have noticeable deviations as illustrated in figure 1 below. [1]
References
- F. Kleemiss, O.V. Dolomanov, M. Bodensteiner, N. Peyerimhoff, L. Midgley, L.J. Bourhis, A. Genoni, L.A. Malaspina, D. Jayatilaka, J.L. Spencer, F. White, B. Grundkötter-Stock, S. Steinhauer, D. Lentz, H. Puschmann, and S. Grabowsky, "Accurate crystal structures and chemical properties from NoSpherA2", Chemical Science, vol. 12, pp. 1675-1692, 2021. https://doi.org/10.1039/d0sc05526c
Finding and Discovery Aids as part of data availability statements for research articles.
Wednesday, February 19th, 2025Starting around 2016, journal publishers started including mandatory “Data Availability” statements as part of research articles; a typical (dated) example is linked here, including guidelines for how to cite the data itself. I wrote about these aspects last year in a blog post for the RSC journal Digital Discovery[1] and here I follow up with more news.
References
- H. Rzepa, "The evolving roles of data and citations in journal articles", 2024. https://doi.org/10.26434/chemrxiv-2024-dz2dv
The secrets of FAIR Metadata: optimisation for Chemical Compounds.
Wednesday, December 11th, 2024The idea of so-called FAIR (Findable, Accessible, Interoperable and Reusable) data is that each object has an associated metadata record which serves to enable the four aspects of FAIR. Each such record is itself identified by a persistent identifier known as a DOI. The trick in producing useful FAIR data is defining what might be termed the “granularity” of data objects that generate the most readily findable and which most usefully enable the other three attributes of FAIR.
Raw data and the evolution of crystallographic FAIR data. Journals, processed and raw structure data.
Monday, March 28th, 2022In my previous post on the topic, I introduced the concept that data can come in several forms, most commonly as “raw” or primary data and as a “processed” version of this data that has added value. In crystallography, the chemist is interested in this processed version, carried by a CIF file. However on rare occasions when a query arises about the processed component, this can in principle at least be resolved by taking a look at the original raw data, expressed as diffraction images. I established with much appreciated help from CCDC that since 2016, around 65 datasets in the CSD (Cambridge structural database) have appeared with such associated raw data. The problem is easily reconciling the two sets of data (the raw data is not stored on CSD) and one way of doing this is via the metadata associated with the datasets. In turn, if this metadata is suitably registered, one can query the metadata store for such associations, as was illustrated in the previous post on the topic. Here I explore the metadata records for five of these 65 sets to find out their properties, selected to illustrate the five data repositories thus far that host such data for compounds in the CSD database.
Raw data: the evolution of FAIR data and crystallography.
Tuesday, March 1st, 2022Scientific data in chemistry has come a long way in the last few decades. Originally entangled into scientific articles in the form of tables of numbers or diagrams, it was (partially) disentangled into supporting information when journals became electronic in the late 1990s.[1] The next phase was the introduction of data repositories in the early naughties. Now associated with innovative commercial companies such as Figshare and later the non-commercial Zenodo, such repositories have also spread to institutional form such as eg the earlier SPECTRa project of 2006[2] and still evolving.[3] Perhaps the best known, and certainly one of the oldest examples of curated structural data in chemistry is the CCDC (Cambridge crystallographic data centre) CSD (Cambridge structural database) which has been operating for more than 55 years now, even before the online era! Curation here is the important context, since there you will find crystal diffraction data which has been refined into a structural model, firstly by the authors reporting the structure and then by CSD who amongst other operations, validate the associated data using a utility called CheckCIF.[4] What perhaps is not realised by most users of this data source is that the original or “raw” data, as obtained from a X-ray diffractometer and which the CSD data is derived from, is not actually available from the CSD. This primary form of crystallographic data is the topic of this post.
References
- A.M. Hunter, and A.B. Smith, "Review of Supporting Information at <i>Organic Letters</i>", Organic Letters, vol. 17, pp. 2867-2869, 2015. https://doi.org/10.1021/acs.orglett.5b01700
- J. Downing, P. Murray-Rust, A.P. Tonge, P. Morgan, H.S. Rzepa, F. Cotterill, N. Day, and M.J. Harvey, "SPECTRa: The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories", Journal of Chemical Information and Modeling, vol. 48, pp. 1571-1581, 2008. https://doi.org/10.1021/ci7004737
- M.J. Harvey, A. McLean, and H.S. Rzepa, "A metadata-driven approach to data repository design", Journal of Cheminformatics, vol. 9, 2017. https://doi.org/10.1186/s13321-017-0190-6
- A.L. Spek, "Structure validation in chemical crystallography", Acta Crystallographica Section D Biological Crystallography, vol. 65, pp. 148-155, 2009. https://doi.org/10.1107/s090744490804362x
Data base or Data repository? – A brief and very selective history of data management in chemistry.
Wednesday, January 26th, 2022Way back in the late 1980s or so, research groups in chemistry started to replace the filing of their paper-based research data by storing it in an easily retrievable digital form. This required a computer database and initially these were accessible only on specific dedicated computers in the laboratory. These gradually changed from the 1990s onwards into being accessible online, so that more than one person could use them in different locations. At least where I worked, the infrastructures‡ to set up such databases were mostly not then available as part of the standard research provisions and so had to be installed and maintained by the group itself. The database software took many different forms and it was not uncommon for each group in a department to come up with a different solution that suited its needs best. The result was a proliferation of largely non-interoperable solutions which did not communicate with each other. Each database had to be searched locally and there could be ten or more such resources in a department. The knowledge of how the system operated also often resided in just one person, which tended to evaporate when this guru left the group.
Quantum chemistry interoperability (library): another step towards FAIR data.
Saturday, January 1st, 2022To be FAIR, data has to be not only Findable and Accessible, but straightforwardly Interoperable. One of the best examples of interoperability in chemistry comes from the domain of quantum chemistry. This strives to describe a molecule by its electron density distribution, from which many interesting properties can then be computed. The process is split into two parts:
First came Molnupiravir – now there is Paxlovid as a SARS-CoV-2 protease inhibitor. An NCI analysis of the ligand.
Saturday, November 13th, 2021Earlier this year, Molnupiravir hit the headlines as a promising antiviral drug. This is now followed by Paxlovid, which is the first small molecule to be aimed by design at the SAR-CoV-2 protein and which is reported as reducing greatly the risk of hospitalization or death when given within three days of symptoms appearing in high risk patients.