Data galore! 134 kilomolecules.

I do go on a lot about the importance of having modern access to data. And so the appearance of this article[1] immediately struck me as important. It is appropriately enough in the new journal Scientific Data. The data contain computed properties at the B3LYP/6-31G(2df,p) level for 133,885 species with up to nine heavy atoms, and the entire data set has its own DOI[2]. The data is generated by subjecting a molecule set to a number of validation protocols, including obtaining relaxed (optimised) geometries at the B3LYP/6-31G(2df,p) level. It would be good to replicate this set with inclusion of a functional that also includes dispersion, and of course making the coordinates all available in this manner greatly facilitates this. The collection also includes data for e.g. 6095 constitutional isomers of C7H10O2, which reminds me of an early, delightfully entitled, article adopting such an approach in quantum chemistry[3]. Such collections are an important part of the process of validating computational methods[4] This way of publishing data does raise some interesting discussion points.

  1. In this case, we have coordinates for 134 kilo molecules, but the individual molecules in this collection do not have formalised metadata. The InChI key is an example of such metadata, and means that the metadata can be specifically searched. Where you have a monolithic collection of 134k molecules, no such structured exposed metadata exists for individual entries and you will have to generate it yourself in order to search it.
  2. Each of the molecules in this collection is revealed (once you have downloaded the compressed archive as above and decompressed it into a 548 Mbyte folder) as separate XYZ files. This syntax has the merit of being very simple, and can easily be processed by a human. Computed molecular properties in the form of metadata are missing from this particular (relatively ancient) format. To recover them, you would have to repeat the calculation.
  3. In fact the XYZ files in this example do seem to have some (unformalised) properties appended to the bottom of the XYZ file (the SMILES and InChI strings are recognizably there, shown as an example below
    27
    gdb 57483   2.68237 1.10148 0.98017 0.0557  94.95   -0.2958 0.073 ...
    C   -0.0805964233    1.5844710741    0.1983967506   -0.41097
    .........
    29.7376 87.1304 196.1576    216.856 ...
    CC(C)(C)C1CCCC1 CC(C)(C)C1CCCC1 
    InChI=1S/C9H18/c1-9(2,3)8-6-4-5-7-8/h8H,4-7H2,1-3H3 InChI=1S/C9H18/c1-9(2,3)8-6-4-5-7-8/h8H,4-7H2,1-3H3
    

    This of itself does raise some issues.

    1. The title line (starting gdb) has extra numbers, but it is not immediately obvious what these are.
    2. The XYZ file is no longer standard because extra information is appended, both to each atom line (the charge? shown above as -0.41097) and to the bottom. Much software will not recognise this non-standard XYZ file, and is likely to discard the additional information. Thus I tried wxMacMolPlt (a long time reader of XYZ files) with no success. Human editing of the file was required to remove the additional information before a sensible molecule loaded. Only at this point could one progress to (re)compute the molecular properties.
    3. The extra information is not formally described. As a human I can recognise it as an atom coordinate list with appended charges (I think), to which is appended a  list of normal coordinate harmonic wavenumbers in units of cm-1, a SMILES and InChI as separate lines. That is really informed guesswork (a human is very good at such pattern recognition) but I cannot be absolutely certain, and a machine seeing this for the first time would certainly struggle.
    4. The last lines contains repetitions of the SMILES and InChI strings. I am guessing that this is the connectivity determined before and after geometry optimisation (using quantum mechanics, bonds can indeed break or form during such a process) but I may be quite wrong about that. I have not tried to resolve this issue by actually reading the depths of the article, since the file itself really should carry such information.
    5. The XYZ file itself carries no provenance, such as who created the file, which software and version was used to create it, the date of creation etc.
  4. An alternative approach is the one adopted here on this blog. Each individual molecule is assigned a DOI and its own metadata and provenance. It is presented to the user in a variety of syntactical forms, each designed for a different purpose, and each adopted for these needs. Thus the syntax and semantics of a CML file are clearly defined by a Schema, and this format can easily absorb additional information without “breaking the standard”. It too can be scaled to 134 kilo molecules[4] although this does require a suitable container (repository) to handle this scale (and I am not entirely sure that DataCite would approve of the generation of 134 kiloDOIs).

Overall, this sort of data publication must be warmly welcomed by the community, and I do hope that more chemistry data is routinely made available in appropriate manner. The presentation in ready-to-reuse form will no doubt improve as the value of such data becomes more fully appreciated. And ultimately, humans need to be excluded from much of this process (editing the 133,885 sets of XYZ coordinates as described above is not for humans to do).


‡Your computer however might balk at opening a folder with 133,885 items in it. Try this only on a very fast machine with lots of memory and ideally an SSD!

Contrary to some rumors, I do not hail from the planet Zog.

References

  1. R. Ramakrishnan, P.O. Dral, M. Rupp, and O.A. von Lilienfeld, "Quantum chemistry structures and properties of 134 kilo molecules", Scientific Data, vol. 1, 2014. http://dx.doi.org/10.1038/sdata.2014.22
  2. Raghunathan Ramakrishnan., Pavlo Dral., Pavlo O. Dral., Matthias Rupp., and O. Anatole von Lilienfeld., "Quantum chemistry structures and properties of 134 kilo molecules", 2015. http://dx.doi.org/10.6084/m9.figshare.978904
  3. P.P. Bera, K.W. Sattelmeyer, M. Saunders, H.F. Schaefer, and P.V.R. Schleyer, "Mindless Chemistry", The Journal of Physical Chemistry A, vol. 110, pp. 4287-4290, 2006. http://dx.doi.org/10.1021/jp057107z
  4. P. Murray-Rust, H.S. Rzepa, J.J.P. Stewart, and Y. Zhang, "A global resource for computational chemistry", Journal of Molecular Modeling, vol. 11, pp. 532-541, 2005. http://dx.doi.org/10.1007/s00894-005-0278-1

Tags: , ,

5 Responses to “Data galore! 134 kilomolecules.”

  1. Dear Prof. Rzepa,

    Thank you for the blog. Just want to clarify the “issues raised” in the blog

    “1. The title line (starting gdb) has extra numbers, but it is not immediately obvious what these are.”
    — This is mentioned in the paper as “Now, the comment line is used to store all scalar properties,” These properties are rotational constants, dipole moment, polarizability, energy of HOMO and LUMO, radial expectation value, thermochemistry energietics. The order in which these scalar properties are listed is mentioned in Table 2 of the paper.

    “2. The XYZ file is no longer standard because extra information is appended, both to each atom line (the charge? shown above as -0.41097) and to the bottom. Much software will not recognise this non-standard XYZ file,”
    — I tried a few software while preparing the XYZ files. Avogadro which is one of the most widely used open source softwares displays the molecule correctly.

    “3. The extra information is not formally described. As a human† I can recognise it as an atom coordinate list with appended charges (I think), to which is appended a list of normal coordinate harmonic wavenumbers in units of cm-1, a SMILES and InChI as separate lines. That is really informed guesswork ”
    — Table 2 summarizes the content of a single XYZ file. Furthermore, below Table 3, we say “Mulliken charges are added as a fifth column. Harmonic vibrational frequencies, SMILES and InChI are appended as respective additional lines.”

    “4. The last lines contains repetitions of the SMILES and InChI strings. I am guessing that this is the connectivity determined before and after geometry optimisation (using quantum mechanics, bonds can indeed break or form during such a process) but I may be quite wrong about that.”
    — The SMILES and InChI strings correspond to the Cartesian coordinates displayed on the XYZ file. The change in connectivities before and after geometry optimization is also discussed on the paper under the subsection “Validation of geometry consistency”

    “5. The XYZ file itself carries no provenance, such as who created the file, ”
    — The files were created by the authors of the paper. The file was created by linux utilities such as cat, grep, more, tail which are envoked from perl, awk, sed, and/or bash scripts. The chemistry software used are mentioned in the paper at appropriate places.

    “‡Your computer however might balk at opening a folder with 133,885 items in it. Try this only on a very fast machine with lots of memory and ideally an SSD!”
    — I would recommend exploring the zip without extracting the file using linux utilities zgrep, zcat, zmore, zless and so on.

    Finally, I do agree with your view that we need a standard conventions to store molecular properties to facilitate usage with less efforts. In the “134k molecules” paper, we tried to make things extractable using standard linux utilities or “one-liner” bash scripts .

    Best,
    Raghu

  2. Anonymous says:

    The readme.txt file associated with this dataset contains plenty of information about what data is included in the xyz files, units and other details. No guesswork needed.

    http://files.figshare.com/1535303/readme.txt

  3. Henry Rzepa says:

    In reply to “anonymous”

    The readme.txt file is a traditional file for collecting information for interpretation only by a human reader. It is very unlikely that any software would be able to extract much meaning from such a file. And I made the point that only software can handle data on the scale it is available here.

    In a semantic system, one really should not use a readme file to add semantics.

  4. Henry Rzepa says:

    In reply to Raghunathan Ramakrishnan,

    Thakns for the very interesting reply. Much appreciated.

    1. I appreciate that much is mentioned in the paper, but divorcing the semantics (as described in the paper) from the data (as held in an XYZ file) is a less robust way of handling such semantics. The values as in the XYZ are also unit less, and again re-associating the values with their unit does require a human to make the association. I would also add that storing such (semantically) important information in a comment line (which is designed largely for humans) is less effective at exposing such semantics to machines.

    2. I tried the latest version of MacMolPlt and it refused to read the file until I removed the 5th column following the XYZ coordinates. Possibly Avogadro is less fussy and simply reads only the first four columns. In general how a program might react to “unexpected data” is quite unpredictable (and this behaviour is probably not documented anywhere, and is declared only in the source codes). I might add that whilst no-one would have a problem persuading a text editor to remove a line from a file, persuading it to select and delete a column is a more interesting challenge. That is how I did it (using eg BBedit), but I suspect many people might simply do it line by line.

    3. I dont think it is in general robust to ad hoc “redefine” a standard (in this case an accepted rather than formal standard) by adding e.g. a column to it. As we see with eg MacMolPlt, how a program might handle such extensions is quite unpredictable.

    4. See 1.

    5. I did say that the file itself should carry provenance. It is obviously inherited in this case by reference to “the paper” but that association is firstly very easily lost, and secondly only interpretable by a human.

    I think the overall point is that we used to create data purely for consumption by fellow humans. Nowadays the data is created under instruction by humans, but its consumption still requires the human to put much effort into recognising the semantics and syntax of the data. We should strive to remove as much as possible of this latter activity, so that data is immediately fit for use by machines/software, with as little help from humans as possible.

    I will also append to this comment the link https://www.repository.cam.ac.uk/handle/1810/724 This dates from 2005, and contains 175 kilo molecules, each with its own identifier (e.g. http://www.dspace.cam.ac.uk/handle/1810/183615) and with each file containing semantics as identified by the file by using XML. In this case the molecules were optimised using semi-empirical theory rather than DFT, but the principle is the same. The XML, dating from 2005, is not up to the standards expected of today (thus you will not find a declaration of eg the semi-empirical origins of the data) but it does show the way forward. See ref 4 in my post.

  5. […] work has already caught attention. For instance, Henry S. Rzepa in his post “Data galore! 134 kilomolecules” recognizes the need for such data sets, but also raises important issues of data formats that must […]

Leave a Reply