Accessing (raw) chemical data: a peek into the CIF format.

There is much focus at the moment on how to ensure experimental replicability in e.g. the molecular sciences. An important aspect of that is having access to FAIR data; data which is findable, accessible, inter-operable and re-usable. One of the “gold standards” in chemistry is the data associated with crystal structures. Here I take an inside peek into the standard file-type for carrying crystal structure data, the CIF file (the Crystallographic Information File).

CIF is a tightly managed format, with utility tools such as checkCIF to validate the files and check for errors. It is also what is called a processed data format, created from structural analysis of the raw image data that emerges from a diffractometer, and is therefore what might be described as a lossy format. Discussing these aspects with our crystallographer here (thanks Andrew!), I began to realise that there are at least three distinctly different versions of a CIF file, each carrying a different degree of data loss.

I am going to take as my illustration of this structure[1] known by three different identifiers; AZUJOW, CCDC 1406199 or DOI: 10.5517/ccdc.csd.cc1j6888

The CIF originates with the authors and this version is 449KB in size. I have deposited it and the other two at DOI: 10.14469/hpc/2752 for you to inspect and compare them. This file is relatively large since it contains the so-called structure factors or hkl information, a snippet of which looks like:
```
_shelx_hkl_file 
; 
   0   0   1 108882. 1066.19   2 
   0   0   2 320.055 130.609   2 
   0   0   3 18538.0 806.608   2 
   0   0   4 173192. 2808.03   2 
```
This information is removed using a utility known as shredcif to produce a second version, known as the name_x.cif version and reducing the size to 27KB. This retains information about properties such as thermal ellipsoids and bond length and angle information but loses the hkl information.
After the CIF is submitted to CSD, it emerges as AZUJOW.cif, which is now just 7KB in size and is now missing the bond lengths and angles etc.

The original raw image data for this structure is not publicly available, but you can see a set of structures for which it IS available at DOI:10.14469/hpc/2297 (published as [2] and where the file sizes are typically 200-600 MB (they can get much larger).

So a CIF can vary in data content between 7- 449KB, and the original “raw” data can be ten thousand times larger than this! To acquire all the flavours, you have to access both the CSD and contact the original authors (unless of course the latter have deposited their versions in an open data repository, as above).

Fortunately for most chemical applications, even the “lossiest” of the CIF formats is more than adequate. But for the gold standard in chemical data, you should be aware that you may still be losing access to a lot of original data in the CIF formats and of course to all of the raw diffractometer data. I think it fair to say however that there is now momentum to increasingly retain as much of this data as is possible for posterity.

Author

Henry Rzepa

Henry Rzepa is Emeritus Professor of Computational Chemistry at Imperial College London.

View all posts

References

A. Toscani, K.A. Jantan, J.B. Hena, J.A. Robson, E.J. Parmenter, V. Fiorini, A.J.P. White, S. Stagni, and J.D.E.T. Wilton-Ely, "The stepwise generation of multimetallic complexes based on a vinylbipyridine linkage and their photophysical properties", Dalton Transactions, vol. 46, pp. 5558-5570, 2017. https://doi.org/10.1039/c6dt03810g
J. Almond-Thynne, A.J.P. White, A. Polyzos, H.S. Rzepa, P.J. Parsons, and A.G.M. Barrett, "Synthesis and Reactions of Benzannulated Spiroaminals: Tetrahydrospirobiquinolines", ACS Omega, vol. 2, pp. 3241-3249, 2017. https://doi.org/10.1021/acsomega.7b00482

This entry was posted on Friday, July 21st, 2017 at 8:21 am and is filed under Chemical IT. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

7 Responses to “Accessing (raw) chemical data: a peek into the CIF format.”

Henry Rzepa says:

July 23, 2017 at 8:26 am

One of the standard tests for data loss is the “round tripping” operation. Most frequently applied to eg Chemdraw structures (across platforms, or in and out of eg Microsoft Office), it can be applied to crystallographic information using e.g. the CCDC Mercury program. I thought I would apply it to the three CIF structures described above.
1. The original JWE1401b.cif (449,200 bytes) is reduced to 5,465 bytes by Mercury.
2. The “shredded” JWE1401b_x.cif (26,671 bytes) file is likewise reduced to 5,465 bytes.
3. The version retrieved from CSD (6,747 bytes) is reduced slightly less to 5,540 bytes.

So Mercury is a “round trip” lossy procedure, and you should be aware that passage through Mercury will discard information. It would be nice if Mercury had a flag controlling any information loss, in the manner that many e.g. JPEG image programs allow the degree of information loss to be specified.

Reply
Armel Le Bail says:

August 8, 2017 at 5:49 pm

The Crystallography Open Database offers the possibility to store CIFs including interatomic distances and angles as well as Fobs and Fcalc values, contrarily to the CSD. Moreover, the data are gold open access, contrarily to CSD, ICSD. The question is why crystallographers send their data to the CSD but so rarely to the COD.

Reply
Henry Rzepa says:

August 9, 2017 at 1:34 pm

Having that information already available is good to know.

The reason I do not use COD more regularly is that the search tools require you to learn how to make SQL queries directly of the database using a command line, and then analyse the statistics using a separate package such as R. Learning SQL languages in order to search for crystal structures and another to analyse the statistics is too high a barrier for most, and I am no exception. If a good GUI were to be available comparable to e.g. ConQuest I am sure COD would get more use.

Also, to be fair, COD has only around half the entries of CSD. If you are doing a search prior to writing a paper, you want to be as sure as you can be that you are not missing some vital structure. Since CSD is more likely to be comprehensive, you are going to use it at least in the first instance.

Finally all entries in CSD now have an assigned DOI. The advantage there is that it exposes metadata about the structure in a standard way which allows future generations of tools to make automated connections between either diverse structures, or more importantly between structures and articles (the so-called Event Data CrossRef/DataCite project). It is good to eg know who is citing a particular structure in their articles.

Reply
- Bob McMeeking says:
  
  August 9, 2017 at 6:53 pm
  
  I think your remarks about having to use SQL to search the COD are a bit unfair/outdated. The Crystallography Open Database home site – http://www.crystallography.net/cod/ – does support a form based search facilities (including JSME/JCP for a subset of entries). Where is the need for SQL knowledge there? There is also a display option using Jmol/JSmol.
  
  It is indeed true that systems such as ConQuest have a number of very advanced search and analysis options (essential to some but to what extent they are fully used by the average user is a moot point), and is an essential resource.
  
  The question of the number of entries is an issue. Perhaps the crystallography community should really get its act together and work towards having the open software tools available and fostering a culture of submitting structures to the likes of COD as well as the major commercial databases?
  
  Anyone can, of course, download the COD for their own purposes. I write as someone with a special interest here. We download COD updates every night and these are incorporated into the CrystalWorks systems at Daresbury immediately. COD in CrystalWorks is searchable by anyone. As a UK academic user, you can of course search the CSD, ICSD & CRYSTMET simultaneously via CrystalWorks using your CDS/DL username or via the Royal Society of Chemical national Chemical Database Service portal
  
  In your original Blog you discuss the Imperial College Research Data Repository. I am sure there are many others. There could (should?), of course, automatically feed into the COD system.
  
  Also I have memories of the eCrystals initiative. I sometimes wonder whether this is still active in any meaningful way? Distributed repositories which support Apache Subversion (SVN) might also feed into such as CrystalWorks in similar way as does the COD.
  
  Reply
Henry Rzepa says:

August 10, 2017 at 9:08 pm

Bob,

If you look at the types of search I discuss on this blog, most of them are not possible on the COD unless a direct SQL search is constructed. I had a discussion about these searches once with someone from the COD, and that was their advice, as was the use of R to display the statistics. All this had to be done at the time by creating an ssh session onto the COD system directly and issuing terminal commands, after a request to create an account there. The search interface at COD does not allow the search described here.

You query “to what extent they are fully used by the average user is a moot point“? Well, to some extent, my purpose on this blog is to highlight how simple chemistry CAN be retrieved using more advanced searches and I make the point that armed with the search query itself (which I tend to make available via the data repository), such searches can be replicated by anyone in just a minute or so. Three such “advanced” searches designed as student lab experiments are described in this J. Chem. Ed. article (DOI: 10.1021/acs.jchemed.5b00346), the intention being to train students to be average users, or indeed beyond! I concluded that training students to learn the SQL language to perform such searches was indeed a bit too ambitious.

I would raise another point regarding “curation”. The CSD is heavily curated. Part of this process is the difficult task of assigning atom and bond types to each molecule to enable atom and bond-based advanced searches. I have made the point a number of times on this blog that advanced searches can reveal errors (“outliers”) in e.g. bond assignments (here is one example, see the comments section and another and the most recent). I would love to be able to replicate such searches exactly on the COD to compare the level of atom and bond curation on that database with that on the CSD. Are the bond types assigned more accurately on COD than on CSD? It would be fantastic to know the answer to this, but to do so, I will have to learn how to peek at a MySQL database directly, something I have hitherto not found out. So if someone better versed in searching the COD database can perform the three examples noted earlier on COD itself and report on the quality of the results, I think we would all be very grateful!

Reply
Henry Rzepa says:

August 10, 2017 at 9:52 pm

Bob, Re: In your original Blog you discuss the Imperial College Research Data Repository. I am sure there are many others. There could (should?), of course, automatically feed into the COD system.

One advantage of registering a DOI for crystal structures is that it enables searches such as
```
https://search.datacite.org/works?query=media:chemical\/x\-cif*
```
This retrieves at the moment 37 entries, which can be returned in the form of a JSON array. A query such as http://data.datacite.org/chemical/x-cif/10.14469/hpc/2752 constructed using the JSON data will then download any individual entry.

This might be a programmatic way in which automatic detection and feeding into COD of crystal structures found in data repositories might work. Note that this method does not require any knowledge of the repository API, it is done entirely using registered metadata. I would add however that this does require the repository to take advantage of the metadata schema, which as far as I am aware only ours does at the moment.

Reply
Henry Rzepa says:

September 28, 2017 at 11:23 am

More on the topics raised by Bob McMeeking.

The benefits of any collection of data about molecules is the level of curation undertaken on a periodic or regular basis. In this instance, how e.g. bond types are assigned to any atom pair in the molecules, or indeed if they are.

The process involved in curation of the CSD, including this aspect, has been very recently described at DOI: 10.1107/S2052520616003954. Apart from a one-off manual curation at the point of deposition, CCDC tell me that annual CSD ongoing improvement projects are undertaken, typically improving nearly 10% of the existing entries in the CSD, and described here for 2017.

The curation process for COD can be found at DOI: 10.1093/nar/gkr900.
Reading the latter, in the section headed “Manual data curation”, there is nothing explicit stated about the processes involved in assignment of bond types and atom-pair connectivity.

It would be indeed interesting to compare as directly as possible the same structures on both databases, particularly those amongst the 10% influenced by e.g. the annual CSD curation activities, to see how the assigned bond types in particular compare.

Reply

Henry Rzepa's Blog