Towards better chemical databases for atomistic machine learning

Date:

Machine learning (ML) has revolutionized the field of atomistic simulations. It is now possible to obtain high-quality predictions of chemical properties such as total energies, forces, or dipoles at a low computational cost. Currently, the field is at a stage at which atomistic simulations in the gas phase on the sub-microsecond time scale with ab-initio MP2 quality can be carried out routinely. Given that the computational effort to evaluate such a statistical model is independent of the quality of the input data, the most significant bottleneck for devising yet better ML models is the considerable amount of data required to train them. Although the community consensus is that more data naturally leads to better performance; it has been found that this working hypothesis is not necessarily correct for predicting chemical properties with models trained on commonly used databases such as QM9 or ANI-1. Consequently, there is a need to identify how to obtain suitable data for training ML models and for established databases on how to add/remove information while retaining the best performance of the model.

In this contribution, we will discuss the use of uncertainty quantification (UQ) methods for atomistic neural networks to identify cases for which additional information is required or for which redundant information complicates the prediction of a specific property. Furthermore, the performance of different data augmentation (DA) methods like sampling from conformational space, use of Atom-in-Molecule (AMONS) fragments and generative models (graph and diffusion-based) will be discussed. Combining UQ and DA methods sets the stage for a workflow to obtain more robust and data-efficient chemical databases while retaining prediction accuracy.

See a summary and the presented slides at: Here