Towards better chemical databases for atomistic machine learning
Talk, ACS Fall: Harnessing the Power of Data, San Francisco, California, United States
Machine learning (ML) has revolutionized the field of atomistic simulations. It is now possible to obtain high-quality predictions of chemical properties such as total energies, forces, or dipoles at a low computational cost. Currently, the field is at a stage at which atomistic simulations in the gas phase on the sub-microsecond time scale with ab-initio MP2 quality can be carried out routinely. Given that the computational effort to evaluate such a statistical model is independent of the quality of the input data, the most significant bottleneck for devising yet better ML models is the considerable amount of data required to train them. Although the community consensus is that more data naturally leads to better performance; it has been found that this working hypothesis is not necessarily correct for predicting chemical properties with models trained on commonly used databases such as QM9 or ANI-1. Consequently, there is a need to identify how to obtain suitable data for training ML models and for established databases on how to add/remove information while retaining the best performance of the model.