Quantum chemistry interoperability (library): another step towards FAIR data.

To be FAIR, data has to be not only Findable and Accessible, but straightforwardly Interoperable. One of the best examples of interoperability in chemistry comes from the domain of quantum chemistry. This strives to describe a molecule by its electron density distribution, from which many interesting properties can then be computed. The process is split into two parts:

  1. Computation of the wavefunction. This can be very very compute intensive process, which can take quite a few days even using 64 or more processors in parallel and requires highly specialised programs to achieve this.
  2. Analysis of the wavefunction. The range of properties that can be computed is impressively large, but again this requires specialised algorithms and programs.

So one can see that the need to Interoperate wavefunction data computed during process 1 into analysis in process 2 is crucial. This is normally achieved using intermediate data files, and clearly the semantics of the data in these files must be perfectly communicated between the two processes.

With this introduction over, my attention was drawn to a recent post on the CCL (Computational Chemistry List, http://www.ccl.net), a veritable resource that has been running for many decades and where many aspects of computational chemistry are discussed. One recent such relates to quantum chemistry interoperability; http://www.ccl.net/cgi-bin/ccl/day-index.cgi?2021+12+30 where many interesting points were made. I highlight just two here (but urge you to read the entire thread).

  1. The first, by Mike Frisch (http://www.ccl.net/cgi-bin/ccl/message-new?2021+12+30+003) introduces two interoperability formats (the binary array file format) along with a library of routines in both Fortran and Python which facilitate interoperability between wavefunction calculating and the post-processing analysis programs. The advantages of this include “Like the fchk file, this is a self-defining file, but it is binary so that full precision can be retained and reading/writing the file is much faster” and is described at https://gaussian.com/interfacing/ Output in this format is controlled by the keyword Output=MatrixElement or use of environment variables. As a long time user of an older interoperability mechanism, the so-called WFN and WFX formats for use with programs such as AIMALL and MultiWFN, I have often set this keyword to eg Output=wfn and when generated, such files are routinely included in our FAIR data publications which are often mentioned both in this blog and in the journal articles we write. If you read the post by Mike, you will understand both the deficiencies of these earlier formats and how the binary array file is an important advance. 
    • I make one “user interface plea” here in the hope that Gaussian might be able to do something about it. By default, the output key word is not set and so no wavefunction data is produced other than a binary .CHK file. This in turn requires an extra step to convert it into the interoperable non-binary .FCHK file. When needing a WFN file, very often I forget to set the output keyword flag to a value and have to re-run the program to obtain it. So my plea is to consider setting the program defaults to write out some form of the binary array file when the job completes. There are additional flags that can be set for specialised applications, but assuming a default option would be practical, it would be good to have.
  2. The second email is a response to Mike’s post by Tian Lu  who is well known for his amazing “swiss army knife” program MultWFN, which can compute a large variety of molecular properties using wavefunction files. He had in fact proposed his own interoperability format to eliminate many of the recognised issues with the older WFN, FCHK and WFX formats and which is called MWFN (documented here[1]). Currently this particular format is not yet widely supported by wavefunction-computing programs such as e.g. Gaussian, but perhaps Output=mwfn will come one day!
  3. This is a later email describing the Trexio Project (https://trex-coe.github.io/trexio/ and specifically https://trex-coe.github.io/trexio/trex.html) in which a metadata group is specifically identified because “we need to give the possibility to the users to store some metadata inside the files.” In fact, metadata is also useful for registration with metadata agencies.

This increasing discussion of Interoperability in Quantum Chemistry has to be warmly welcomed. It directly feeds into FAIR data and may even set a trend for other areas of chemistry, such as e.g. NMR spectroscopy!


I have now learnt that inserting one of the environment variables below as per

export GAUSS_OMDEF=fortranbinaryarray.faf
or
export GAUSS_ORDEF=rawbinaryarray.baf

into job scripts will achieve this (proposed media types chemical/x-rawbinaryarray  .baf and chemical/x-fortranwbinaryarray  .faf).

Currently doing both at the same time is not supported (G16 C C.01), so the second file can be generated from a .chk file using the post-processing commands appended to the job script:

formchk -raw mychk.chk rawbinaryarray.baf
or
formchk -mat mychk.chk fortranbinaryarray.faf


This post has DOI: 10.14469/hpc/10043


References

  1. T. Lu, and Q. Chen, "mwfn: A Strict, Concise and Extensible Format for Electronic Wavefunction Storage and Exchange", 2021. http://dx.doi.org/10.26434/chemrxiv-2021-lt04f-v5

Tags:

Leave a Reply